The file ‘WeatherEDA_202005_v002.Rmd’ contains exploratory data analysis for historical weather data as contained in METAR archives hosted by Iowa State University.
Data have been dowloaded, processed, cleaned, and integrated for several stations (airports) and years, with .rds files saved in “./RInputFiles/ProcessedMETAR”.
This module will perform initial modeling on the processed weather files. It builds on the previous ‘WeatherModeling_202006_v001.Rmd’ and leverages functions in ‘WeatherModelingFunctions_v001.R’.
There are three main processed files available for further exploration:
metar_postEDA_20200617.rds and metar_postEDA_extra_20200627.rds
metar_modifiedClouds_20200617.rds and metar_modifiedclouds_extra_20200627.rds
metar_precipLists_20200617.rds and metar_precipLists_extra_20200627.rds
Several mapping files are defined for use in plotting; tidyverse, lubridate, and caret are loaded; and the relevant functions are sourced:
# The process frequently uses tidyverse, lubridate, caret, and randomForest
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# The main path for the files
filePath <- "./RInputFiles/ProcessedMETAR/"
# Sourcing functions
source("./WeatherModelingFunctions_v001.R")
# Descriptive names for key variables
varMapper <- c(source="Original source file name",
locale="Descriptive name",
dtime="Date-Time (UTC)",
origMETAR="Original METAR",
year="Year",
monthint="Month",
month="Month",
day="Day of Month",
WindDir="Wind Direction (degrees)",
WindSpeed="Wind Speed (kts)",
WindGust="Wind Gust (kts)",
predomDir="General Prevailing Wind Direction",
Visibility="Visibility (SM)",
Altimeter="Altimeter (inches Hg)",
TempF="Temperature (F)",
DewF="Dew Point (F)",
modSLP="Sea-Level Pressure (hPa)",
cType1="First Cloud Layer Type",
cLevel1="First Cloud Layer Height (ft)",
isRain="Rain at Observation Time",
isSnow="Snow at Observation Time",
isThunder="Thunder at Obsevation Time",
tempFHi="24-hour High Temperature (F)",
tempFLo="24-hour Low Temperature (F)",
minHeight="Minimum Cloud Height (ft)",
minType="Obscuration Level at Minimum Cloud Height",
ceilingHeight="Minimum Ceiling Height (ft)",
ceilingType="Obscuration Level at Minimum Ceiling Height",
hr="Hour of Day (Zulu time)",
hrfct="Hour of Day (Zulu time)",
hrBucket="Hour of Day (Zulu time) - rounded to nearest 3",
locNamefct="Locale Name"
)
# File name to city name mapper
cityNameMapper <- c(katl_2016="Atlanta, GA (2016)",
kbos_2016="Boston, MA (2016)",
kdca_2016="Washington, DC (2016)",
kden_2016="Denver, CO (2016)",
kdfw_2016="Dallas, TX (2016)",
kdtw_2016="Detroit, MI (2016)",
kewr_2016="Newark, NJ (2016)",
kgrb_2016="Green Bay, WI (2016)",
kgrr_2016="Grand Rapids, MI (2016)",
kiah_2016="Houston, TX (2016)",
kind_2016="Indianapolis, IN (2016)",
klas_2014="Las Vegas, NV (2014)",
klas_2015="Las Vegas, NV (2015)",
klas_2016="Las Vegas, NV (2016)",
klas_2017="Las Vegas, NV (2017)",
klas_2018="Las Vegas, NV (2018)",
klas_2019="Las Vegas, NV (2019)",
klax_2016="Los Angeles, CA (2016)",
klnk_2016="Lincoln, NE (2016)",
kmia_2016="Miami, FL (2016)",
kmke_2016="Milwaukee, WI (2016)",
kmsn_2016="Madison, WI (2016)",
kmsp_2016="Minneapolis, MN (2016)",
kmsy_2014="New Orleans, LA (2014)",
kmsy_2015="New Orleans, LA (2015)",
kmsy_2016="New Orleans, LA (2016)",
kmsy_2017="New Orleans, LA (2017)",
kmsy_2018="New Orleans, LA (2018)",
kmsy_2019="New Orleans, LA (2019)",
kord_2014="Chicago, IL (2014)",
kord_2015="Chicago, IL (2015)",
kord_2016="Chicago, IL (2016)",
kord_2017="Chicago, IL (2017)",
kord_2018="Chicago, IL (2018)",
kord_2019="Chicago, IL (2019)",
kphl_2016="Philadelphia, PA (2016)",
kphx_2016="Phoenix, AZ (2016)",
ksan_2014="San Diego, CA (2014)",
ksan_2015="San Diego, CA (2015)",
ksan_2016="San Diego, CA (2016)",
ksan_2017="San Diego, CA (2017)",
ksan_2018="San Diego, CA (2018)",
ksan_2019="San Diego, CA (2019)",
ksat_2016="San Antonio, TX (2016)",
ksea_2016="Seattle, WA (2016)",
ksfo_2016="San Francisco, CA (2016)",
ksjc_2016="San Jose, CA (2016)",
kstl_2016="Saint Louis, MO (2016)",
ktpa_2016="Tampa Bay, FL (2016)",
ktvc_2016="Traverse City, MI (2016)"
)
# File names in 2016, based on cityNameMapper
names_2016 <- grep(names(cityNameMapper), pattern="[a-z]{3}_2016", value=TRUE)
The main data will be from the metar_postEDA files. They are integrated below, and cloud and ceiling heights are converted to factors:
# Main weather data
metarData <- readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_20200617.rds") %>%
bind_rows(readRDS("./RInputFiles/ProcessedMETAR/metar_postEDA_extra_20200627.rds")) %>%
mutate(orig_minHeight=minHeight,
orig_ceilingHeight=ceilingHeight,
minHeight=mapCloudHeight(minHeight),
ceilingHeight=mapCloudHeight(ceilingHeight)
)
glimpse(metarData)
## Observations: 437,417
## Variables: 43
## $ source <chr> "kdtw_2016", "kdtw_2016", "kdtw_2016", "kdtw_201...
## $ locale <chr> "Detroit, MI (2016)", "Detroit, MI (2016)", "Det...
## $ dtime <dttm> 2016-01-01 00:53:00, 2016-01-01 01:53:00, 2016-...
## $ origMETAR <chr> "KDTW 010053Z 23012KT 10SM OVC020 00/M05 A3021 R...
## $ year <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, ...
## $ monthint <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ month <fct> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ WindDir <chr> "230", "230", "230", "240", "230", "220", "220",...
## $ WindSpeed <int> 12, 12, 11, 14, 16, 13, 14, 16, 13, 16, 17, 13, ...
## $ WindGust <dbl> NA, NA, NA, 23, 22, NA, 20, 20, NA, 22, NA, NA, ...
## $ predomDir <fct> SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, SW, ...
## $ Visibility <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 8, 5, 7, 8, 10, ...
## $ Altimeter <dbl> 30.21, 30.21, 30.19, 30.19, 30.18, 30.16, 30.14,...
## $ TempF <dbl> 32.00, 32.00, 32.00, 30.92, 30.92, 32.00, 30.92,...
## $ DewF <dbl> 23.00, 21.92, 21.02, 19.94, 19.94, 19.94, 19.94,...
## $ modSLP <dbl> 1023.6, 1023.5, 1023.0, 1023.0, 1022.7, 1022.0, ...
## $ cType1 <chr> "OVC", "OVC", "OVC", "OVC", "OVC", "OVC", "OVC",...
## $ cType2 <chr> "", "", "", "", "", "", "", "", "", "OVC", "OVC"...
## $ cType3 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType4 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType5 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cType6 <chr> "", "", "", "", "", "", "", "", "", "", "", "", ...
## $ cLevel1 <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ cLevel2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 4500, 4000, ...
## $ cLevel3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cLevel6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ isRain <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isSnow <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ isThunder <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ p1Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, N...
## $ p36Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, 0, NA...
## $ p24Inches <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFHi <dbl> NA, NA, NA, NA, 36, NA, NA, NA, NA, NA, NA, NA, ...
## $ tempFLo <dbl> NA, NA, NA, NA, 31, NA, NA, NA, NA, NA, NA, NA, ...
## $ minHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ minType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ ceilingHeight <fct> Low, Low, Low, Low, Low, Low, Low, Low, Low, Low...
## $ ceilingType <fct> OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, OVC, BKN...
## $ orig_minHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
## $ orig_ceilingHeight <dbl> 2000, 2000, 2200, 2500, 2700, 2500, 2500, 2500, ...
An initial exploration of the data can use hierarchical clustering based on several numerical features of the data (temperature, dew point, sea-level pressure, wind speed) by month.
Example code includes:
# Create hierarchical clustering for 2016 data
# Distance is calculated only where month is the same
distData <- metarData %>%
filter(year==2016) %>%
select(locale, month, TempF, DewF, modSLP, WindSpeed) %>%
group_by(locale, month) %>%
summarize_if(is.numeric, mean, na.rm=TRUE) %>%
ungroup() %>%
mutate(rowname=paste0(locale, "_", month)) %>%
column_to_rownames() %>%
select(-locale, -month) %>%
as.matrix() %>%
scale() %>%
dist() %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column(var="locale1") %>%
pivot_longer(-locale1, names_to="locale2", values_to="dist") %>%
mutate(city1=str_replace(locale1, "_\\w{3}", ""),
city2=str_replace(locale2, "_\\w{3}", ""),
month1=str_sub(locale1, -3),
month2=str_sub(locale2, -3)) %>%
filter(month1==month2) %>%
group_by(city1, city2) %>%
summarize(meandist=mean(dist), sddist=sd(dist)) %>%
ungroup() %>%
select(-sddist) %>%
pivot_wider(city1, names_from="city2", values_from="meandist") %>%
column_to_rownames(var="city1") %>%
as.matrix() %>%
as.dist(diag=FALSE)
distData %>%
hclust(method="complete") %>%
plot()
distData %>%
hclust(method="single") %>%
plot()
At a glance, there are several sensible findings from the clustering:
There are a handful of cities that do not seem to cluster with anything else:
Suppose that the only information available about two cities were their temperatures and dew points, with month also included. How well would a basic random forest, with mtry=3, classify the cities?
The function is then run for every combination of locales from 2016 in cityNameMapper. A common random seed is applied to every run of the process:
# Create a container list to hold the output
list_2016_TempF_DewF_month <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
set.seed(2006281443)
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_2016_TempF_DewF_month[[n]] <- rfTwoLocales(metarData,
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF", "month"),
ntree=25
)
n <- n + 1
if ((n %% 50) == 0) { cat("Through number:", n, "\n")}
}
}
## Through number: 50
## Through number: 100
## Through number: 150
## Through number: 200
## Through number: 250
## Through number: 300
## Through number: 350
## Through number: 400
Accuracy can then be assessed:
# Create a tibble from the underlying accuracy data
acc_TempF_DewF_month <- map_dfr(list_2016_TempF_DewF_month, .f=helperAccuracyLocale)
# Assess the top 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(-accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Denver, CO (2016) Miami, FL (2016) 0.997 0.997 0.997
## 2 Las Vegas, NV (2016) Miami, FL (2016) 0.990 0.991 0.989
## 3 Miami, FL (2016) Seattle, WA (2016) 0.985 0.985 0.985
## 4 Denver, CO (2016) Tampa Bay, FL (2016) 0.984 0.986 0.982
## 5 Miami, FL (2016) Traverse City, MI (201~ 0.980 0.981 0.980
## 6 Miami, FL (2016) Minneapolis, MN (2016) 0.978 0.978 0.978
## 7 Boston, MA (2016) Miami, FL (2016) 0.976 0.975 0.976
## 8 Denver, CO (2016) New Orleans, LA (2016) 0.975 0.975 0.975
## 9 Miami, FL (2016) Phoenix, AZ (2016) 0.974 0.973 0.975
## 10 Green Bay, WI (2016) Miami, FL (2016) 0.971 0.972 0.971
# Assess the bottom 10 classification accuracies
acc_TempF_DewF_month %>%
arrange(accOverall) %>%
head(10)
## # A tibble: 10 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Newark, NJ (2016) Philadelphia, PA (20~ 0.580 0.591 0.570
## 2 Chicago, IL (2016) Milwaukee, WI (2016) 0.591 0.600 0.583
## 3 Chicago, IL (2016) Madison, WI (2016) 0.598 0.622 0.573
## 4 Detroit, MI (2016) Grand Rapids, MI (20~ 0.608 0.568 0.649
## 5 Chicago, IL (2016) Detroit, MI (2016) 0.610 0.633 0.587
## 6 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.613 0.630 0.597
## 7 Madison, WI (2016) Milwaukee, WI (2016) 0.618 0.625 0.611
## 8 Green Bay, WI (2016) Madison, WI (2016) 0.619 0.586 0.652
## 9 Detroit, MI (2016) Madison, WI (2016) 0.619 0.639 0.598
## 10 Chicago, IL (2016) Grand Rapids, MI (20~ 0.621 0.647 0.595
allAccuracy_month <- select(acc_TempF_DewF_month,
locale=locale1,
other=locale2,
accOverall,
accLocale=accLocale1
) %>%
bind_rows(select(acc_TempF_DewF_month,
locale=locale2,
other=locale1,
accOverall,
accLocale=accLocale2
)
)
# Overall accuracy by location plot
allAccuracy_month %>%
group_by(locale) %>%
summarize_if(is.numeric, mean) %>%
ggplot(aes(x=fct_reorder(locale, accOverall), y=accOverall)) +
geom_point(size=2) +
geom_text(aes(y=accOverall+0.02, label=paste0(round(100*accOverall), "%"))) +
coord_flip() +
labs(x="",
y="Average Accuracy",
title="Average Accuracy Predicting Locale",
subtitle="Predictions made 1:1 to each other locale (average accuracy reported)",
caption="Temperature, Dew Point, Month as predictors\n(50% is baseline coinflip accuracy)"
) +
ylim(c(0.5, 1))
# Overall accuracy heatmap (FIX for better ordering of locales)
# allAccuracy_month %>%
# ggplot(aes(x=locale, y=other)) +
# geom_tile(aes(fill=accOverall)) +
# theme(axis.text.x=element_text(angle=90)) +
# scale_fill_continuous("Accuracy", high="darkblue", low="white") +
# labs(title="Accuracy Predicting Locale vs. Locale",
# caption="Temperature, Dew Point, and Month as predictors\n(50% is baseline coinflip accuracy)",
# x="",
# y=""
# )
Accuracy is best for cities with significantly different locales. The model is especially successful at pulling apart the desert cities (Las Vegas, Phoenix) from everything else, and the humid Florida cities (Miami, Tampa Bay) from everything else.
Next, the simple model is run to classify locale across the full 2016 dataset:
# Run random forest for all 2016 locales
rf_all_2016_TDm <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month"),
locs=names_2016,
ntree=50,
seed=2006281508
)
Summaries can then be created for the accuracy in predicting each locale:
evalPredictions(rf_all_2016_TDm, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 898 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 484 0.182
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 49 0.0185
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 39 0.0147
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 179 0.0675
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 34 0.0128
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 42 0.0158
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 48 0.0181
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 42 0.0158
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 96 0.0362
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 49 0.0185
## # ... with 888 more rows
Accuracy is meaningfully increased compared to the null accuracy, though there is still under 50% accuracy in classifying any locale. This is suggestive that the goal will be to define some archetype climates, and to use those for predicting (e.g., humid, cold, desert, etc.).
Clouds (minimum levels and ceiling heights) can potentially help further differentiate the cold weather cities on the downwind side of the lake, as well as helping to further pull apart various marine and mid-continental climates.
An updated random forest model is then run:
# Run random forest for all 2016 locales
rf_all_2016_TDmc <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF", "month", "minHeight", "ceilingHeight"),
locs=names_2016,
ntree=50,
seed=2006281509
)
The evaluation process is converted to a function:
evalPredictions(rf_all_2016_TDmc, plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height")
## # A tibble: 897 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 643 0.240
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 84 0.0313
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 61 0.0227
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 145 0.0541
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 44 0.0164
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 51 0.0190
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 49 0.0183
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 26 0.00969
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 82 0.0306
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 100 0.0373
## # ... with 887 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDm, currObject=rf_all_2016_TDmc)
Accuracy increases meaningfully, though still well under 50% in most cases.
The model can also be built out to consider wind speed and wind direction. No attempt yet is made to control for over-fitting:
# Run random forest for all 2016 locales
rf_all_2016_TDmcw <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=names_2016,
ntree=50,
seed=2006281934
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcw,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction"
)
## # A tibble: 889 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1096 0.415
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 34 0.0129
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 46 0.0174
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 135 0.0511
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 28 0.0106
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 31 0.0117
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 26 0.00984
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 24 0.00908
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 58 0.0219
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 59 0.0223
## # ... with 879 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmc, currObject=rf_all_2016_TDmcw)
Including wind significantly improves model accuracy for many locales. Even the cold weather cities are now being predicted with around 30%-40% accucacy.
The model can also be built out to consider hour of day and sea-level pressure. No attempt yet is made to control for over-fitting, with the exception that mtry is held at 4:
# Run random forest for all 2016 locales
rf_all_2016_TDmcwa <- rfMultiLocale(metarData,
vrbls=c("TempF", "DewF",
"month",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=names_2016,
ntree=50,
mtry=4,
seed=2006282103
)
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_all_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 880 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1639 0.625
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 16 0.00610
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 23 0.00877
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 75 0.0286
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 8 0.00305
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 18 0.00686
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 12 0.00457
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 6 0.00229
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 45 0.0172
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 51 0.0194
## # ... with 870 more rows
# Get the comparison to previous accuracy
deltaAccuracy(prevObject=rf_all_2016_TDmcw, currObject=rf_all_2016_TDmcwa)
Adding sea-level pressure again significantly improves prediction ability. The cold weather cities are in the roughly 40%-60% accuracy range, and the other cities are in the roughly 60%-80% accuracy range.
Based on the hierarchical clustering and results of the initial random forest models, a few archetype cities are defined in an attempt to pull out distinctions:
A data subset is created, with “hr” added for the Zulu hour of the observation (as integer rather than as factor):
archeCities <- c("kmia_2016", "klas_2016", "klax_2016", "kden_2016",
"ksea_2016", "kdfw_2016", "kmsp_2016", "katl_2016"
)
archeData <- metarData %>%
filter(source %in% archeCities) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales
rf_arche_2016_TDmcwa <- rfMultiLocale(archeData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291353
)
##
## Running for locations:
## [1] "katl_2016" "kden_2016" "kdfw_2016" "klas_2016" "klax_2016" "kmia_2016"
## [7] "kmsp_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 62 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2204 0.835
## 2 Atlanta, GA (2016) Dallas, TX (2016) FALSE 139 0.0527
## 3 Atlanta, GA (2016) Denver, CO (2016) FALSE 39 0.0148
## 4 Atlanta, GA (2016) Las Vegas, NV (2016) FALSE 33 0.0125
## 5 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE 86 0.0326
## 6 Atlanta, GA (2016) Miami, FL (2016) FALSE 50 0.0189
## 7 Atlanta, GA (2016) Minneapolis, MN (2016) FALSE 45 0.0170
## 8 Atlanta, GA (2016) Seattle, WA (2016) FALSE 44 0.0167
## 9 Dallas, TX (2016) Atlanta, GA (2016) FALSE 203 0.0789
## 10 Dallas, TX (2016) Dallas, TX (2016) TRUE 2105 0.818
## # ... with 52 more rows
The model is reasonably successful in pulling apart the archetypes. Denver and Dallas seem potentially less distinct, though all archetypes cities are predicted at 80%+ accuracy.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa,
fullData=metarData,
archeCities=archeCities,
sortDescMatch=TRUE
)
There appear to be several issues:
The first issue is addressed by collapsing Atlanta, Dallas, and Miami and instead using Houston as the archetype; and replacing Minneapolis with Chicago:
archeCities_v02 <- c("kiah_2016", "klas_2016", "klax_2016",
"kden_2016", "ksea_2016", "kord_2016"
)
archeData_v02 <- metarData %>%
filter(source %in% archeCities_v02) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v02
rf_arche_2016_TDmcwa_v02 <- rfMultiLocale(archeData_v02,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291448
)
##
## Running for locations:
## [1] "kden_2016" "kiah_2016" "klas_2016" "klax_2016" "kord_2016" "ksea_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v02,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2321 0.872
## 2 Chicago, IL (2016) Denver, CO (2016) FALSE 99 0.0372
## 3 Chicago, IL (2016) Houston, TX (2016) FALSE 54 0.0203
## 4 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 13 0.00489
## 5 Chicago, IL (2016) Los Angeles, CA (2016) FALSE 70 0.0263
## 6 Chicago, IL (2016) Seattle, WA (2016) FALSE 104 0.0391
## 7 Denver, CO (2016) Chicago, IL (2016) FALSE 119 0.0453
## 8 Denver, CO (2016) Denver, CO (2016) TRUE 2282 0.869
## 9 Denver, CO (2016) Houston, TX (2016) FALSE 3 0.00114
## 10 Denver, CO (2016) Las Vegas, NV (2016) FALSE 126 0.0480
## # ... with 26 more rows
The model is reasonably successful in pulling apart the archetypes. Houston appears the most distinct (as had Miami previously), and there is less overlap going to/from Denver or Seattle.
The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v02,
fullData=metarData,
archeCities=archeCities_v02,
sortDescMatch = TRUE
)
At a glance, the archetypes are more sensible. There is a bit of Denver and a bit of Seattle in most of the cold weather cities, and a bit of Houston in some of the warmer cold weather cities. This is suggestive that there are either more then one type of cold-weather cities, or that cold-weather cities tend to be like other archetypes are certain times.
The model is run again, deleting Denver and Seattle, and converting Los Angeles to San Diego:
archeCities_v03 <- c("kiah_2016", "klas_2016", "ksan_2016", "kord_2016")
archeData_v03 <- metarData %>%
filter(source %in% archeCities_v03) %>%
mutate(hr=lubridate::hour(dtime))
A random forest, with mtry=4, is then run using all variables from previous, as well as hour:
# Run random forest for all 2016 locales in archeData_v03
rf_arche_2016_TDmcwa_v03 <- rfMultiLocale(archeData_v03,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
mtry=4,
seed=2006291459
)
##
## Running for locations:
## [1] "kiah_2016" "klas_2016" "kord_2016" "ksan_2016"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_2016_TDmcwa_v03,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL (2016) Chicago, IL (2016) TRUE 2511 0.946
## 2 Chicago, IL (2016) Houston, TX (2016) FALSE 69 0.0260
## 3 Chicago, IL (2016) Las Vegas, NV (2016) FALSE 26 0.00979
## 4 Chicago, IL (2016) San Diego, CA (2016) FALSE 49 0.0185
## 5 Houston, TX (2016) Chicago, IL (2016) FALSE 44 0.0170
## 6 Houston, TX (2016) Houston, TX (2016) TRUE 2473 0.954
## 7 Houston, TX (2016) Las Vegas, NV (2016) FALSE 19 0.00733
## 8 Houston, TX (2016) San Diego, CA (2016) FALSE 57 0.0220
## 9 Las Vegas, NV (2016) Chicago, IL (2016) FALSE 29 0.0113
## 10 Las Vegas, NV (2016) Houston, TX (2016) FALSE 13 0.00508
## 11 Las Vegas, NV (2016) Las Vegas, NV (2016) TRUE 2482 0.971
## 12 Las Vegas, NV (2016) San Diego, CA (2016) FALSE 33 0.0129
## 13 San Diego, CA (2016) Chicago, IL (2016) FALSE 35 0.0130
## 14 San Diego, CA (2016) Houston, TX (2016) FALSE 29 0.0108
## 15 San Diego, CA (2016) Las Vegas, NV (2016) FALSE 63 0.0235
## 16 San Diego, CA (2016) San Diego, CA (2016) TRUE 2559 0.953
These archetypes are distinct, with accuracies all in the 95% range. The remaining cities can be classified against the archetype data, to explore any patterns that emerge:
localeByArchetype(rf_arche_2016_TDmcwa_v03,
fullData=metarData,
archeCities=archeCities_v03,
sortDescMatch=TRUE
)
Quite a few cities show as blends of the archetype cities, which is likely correct if archetypes are considered to be Cold, Humid, Marine, Desert. An open question is whether to expand that list with another 1-2 archetypes that attempt to better capture Atlanta, Dallas, Denver, St Louis, and DC.
The Marine archetype may need to be converted to Pacific and to better identify the California entities.
A decision about Seattle is needed, as it is a little like everything and everything is a little like it.
To address the issue of the model learning from specific cities rather than archetypes, user-defined clusters will be created, and data from these clusters will be used in modeling:
Further, a variable will be created for “hr”, the Zulu hour of the observation:
# Create the locale mapper
locMapperTibble <- tibble::tribble(
~source, ~locType,
'katl_2016', 'South',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Exclude',
'kdfw_2016', 'South',
'kdtw_2016', 'Cold',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Cold',
'kgrr_2016', 'Cold',
'kiah_2016', 'Humid',
'kind_2016', 'Mixed',
'klas_2016', 'Desert',
'klax_2016', 'Marine',
'klnk_2016', 'Mixed',
'kmia_2016', 'Humid',
'kmke_2016', 'Cold',
'kmsn_2016', 'Cold',
'kmsp_2016', 'Cold',
'kmsy_2016', 'Humid',
'kord_2016', 'Cold',
'kphl_2016', 'Exclude',
'kphx_2016', 'Desert',
'ksan_2016', 'Marine',
'ksat_2016', 'South',
'ksea_2016', 'Exclude',
'ksfo_2016', 'Exclude',
'ksjc_2016', 'Exclude',
'kstl_2016', 'Mixed',
'ktpa_2016', 'Humid',
'ktvc_2016', 'Cold'
)
# Create locMapper
locMapper <- locMapperTibble %>% pull(locType)
names(locMapper) <- locMapperTibble %>% pull(source)
locMapper
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016
## "South" "Exclude" "Exclude" "Exclude" "South" "Cold" "Exclude" "Cold"
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016
## "Cold" "Humid" "Mixed" "Desert" "Marine" "Mixed" "Humid" "Cold"
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016
## "Cold" "Cold" "Humid" "Cold" "Exclude" "Desert" "Marine" "South"
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016
## "Exclude" "Exclude" "Exclude" "Mixed" "Humid" "Cold"
# Create the data file with locType
locData_v04 <- metarData %>%
mutate(locType=locMapper[source], hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_v04 %>%
count(locType)
## # A tibble: 6 x 2
## locType n
## <chr> <int>
## 1 Cold 70101
## 2 Desert 17543
## 3 Humid 35074
## 4 Marine 17536
## 5 Mixed 26187
## 6 South 26309
There is a risk that the model will predict neutral cities as ‘Cold’ given the 4:1 ratio of Cold cities to Marine/Desert cities. The random forest model is first run on the data ‘as is’:
rf_arche_small_TDmcwa_v04 <- rfMultiLocale(locData_v04,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
locVar="locType",
pred="locType",
mtry=4,
otherVar=c("dtime", "source", "locale"),
seed=2006301313
)
##
## Running for locations:
## [1] "Cold" "Desert" "Humid" "Marine" "Mixed" "South"
Prediction accuracy is then assessed:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v04,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
keyVar="locType"
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 19819 0.947
## 2 Cold Desert FALSE 37 0.00177
## 3 Cold Humid FALSE 119 0.00568
## 4 Cold Marine FALSE 159 0.00759
## 5 Cold Mixed FALSE 549 0.0262
## 6 Cold South FALSE 254 0.0121
## 7 Desert Cold FALSE 122 0.0234
## 8 Desert Desert TRUE 4790 0.921
## 9 Desert Humid FALSE 41 0.00788
## 10 Desert Marine FALSE 115 0.0221
## # ... with 26 more rows
The designation of Cold, Desert, Humid, and Marine seems successful. The attempt to split South from Humid resulted in good, if incomplete, separation. The attempt to separate a group of ‘warmer’ cold-weather cities from ‘colder’ cold-weather cities was not particularly successful. This could partially be an artifact of ‘Cold’ having double the data volume as ‘Mixed’.
Classifications by city can also be examined:
archeCities_v04 <- locMapperTibble %>%
filter(locType != "Exclude") %>%
pull(source)
localeByArchetype(rf_arche_small_TDmcwa_v04,
fullData=mutate(metarData, locType=locMapper[source]),
archeCities=archeCities_v04,
sortDescMatch=TRUE
)
Many of the cities are nicely classified in to their assigned archetypes, as desired. Among the cities used in the classifications, concerns include:
Classifications are changed somewhat, and data are then filtered so that an equal number of observations from each locale type are applied in the modeling:
# Create the locale mapper
locMapperTibble_v05 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Exclude',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Exclude',
'kdfw_2016', 'Exclude',
'kdtw_2016', 'Cold',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Cold',
'kgrr_2016', 'Cold',
'kiah_2016', 'Humid',
'kind_2016', 'Exclude',
'klas_2016', 'Desert',
'klax_2016', 'Marine',
'klnk_2016', 'Exclude',
'kmia_2016', 'Humid',
'kmke_2016', 'Cold',
'kmsn_2016', 'Cold',
'kmsp_2016', 'Cold',
'kmsy_2016', 'Humid',
'kord_2016', 'Cold',
'kphl_2016', 'Exclude',
'kphx_2016', 'Desert',
'ksan_2016', 'Marine',
'ksat_2016', 'Exclude',
'ksea_2016', 'Exclude',
'ksfo_2016', 'Exclude',
'ksjc_2016', 'Exclude',
'kstl_2016', 'Exclude',
'ktpa_2016', 'Humid',
'ktvc_2016', 'Cold'
)
# Create locMapper
locMapper_v05 <- locMapperTibble_v05 %>% pull(locType)
names(locMapper_v05) <- locMapperTibble_v05 %>% pull(source)
locMapper_v05
## katl_2016 kbos_2016 kdca_2016 kden_2016 kdfw_2016 kdtw_2016 kewr_2016 kgrb_2016
## "Exclude" "Exclude" "Exclude" "Exclude" "Exclude" "Cold" "Exclude" "Cold"
## kgrr_2016 kiah_2016 kind_2016 klas_2016 klax_2016 klnk_2016 kmia_2016 kmke_2016
## "Cold" "Humid" "Exclude" "Desert" "Marine" "Exclude" "Humid" "Cold"
## kmsn_2016 kmsp_2016 kmsy_2016 kord_2016 kphl_2016 kphx_2016 ksan_2016 ksat_2016
## "Cold" "Cold" "Humid" "Cold" "Exclude" "Desert" "Marine" "Exclude"
## ksea_2016 ksfo_2016 ksjc_2016 kstl_2016 ktpa_2016 ktvc_2016
## "Exclude" "Exclude" "Exclude" "Exclude" "Humid" "Cold"
# Create the data file with locType
locData_v05 <- metarData %>%
mutate(locType=locMapper_v05[source], hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_v05 %>%
count(locType)
## # A tibble: 4 x 2
## locType n
## <chr> <int>
## 1 Cold 70101
## 2 Desert 17543
## 3 Humid 35074
## 4 Marine 17536
The data are the filtered so that there are an equal number of observations from each locale type. The model can later be predicted against the remaining items:
# Set a seed for reporducibility
set.seed(2006301356)
# Find the smallest locale type
nSmall <- locData_v05 %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
subData_v05 <- locData_v05 %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
subData_v05 %>%
count(locType, locale) %>%
arrange(-n)
## # A tibble: 16 x 3
## locType locale n
## <chr> <chr> <int>
## 1 Marine Los Angeles, CA (2016) 8774
## 2 Desert Phoenix, AZ (2016) 8770
## 3 Desert Las Vegas, NV (2016) 8766
## 4 Marine San Diego, CA (2016) 8762
## 5 Humid New Orleans, LA (2016) 4402
## 6 Humid Miami, FL (2016) 4388
## 7 Humid Houston, TX (2016) 4386
## 8 Humid Tampa Bay, FL (2016) 4360
## 9 Cold Grand Rapids, MI (2016) 2234
## 10 Cold Milwaukee, WI (2016) 2231
## 11 Cold Minneapolis, MN (2016) 2205
## 12 Cold Madison, WI (2016) 2188
## 13 Cold Traverse City, MI (2016) 2182
## 14 Cold Detroit, MI (2016) 2173
## 15 Cold Chicago, IL (2016) 2168
## 16 Cold Green Bay, WI (2016) 2155
The previous model can then be run on the data subset:
rf_arche_small_TDmcwa_v05 <- rfMultiLocale(subData_v05,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL, # defaults to running for all locales
ntree=100,
locVar="locType",
pred="locType",
mtry=4,
otherVar=c("dtime", "source", "locale"),
seed=2006301358
)
##
## Running for locations:
## [1] "Cold" "Desert" "Humid" "Marine"
The evaluation process can again be run:
# Get the graphs of the prediction accuracy
evalPredictions(rf_arche_small_TDmcwa_v05,
plotCaption = "Temp, Dew Point, Month/Hour, Cloud/Ceiling Height, Wind Speed/Direction, SLP",
keyVar="locType"
)
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Cold Cold TRUE 4864 0.931
## 2 Cold Desert FALSE 68 0.0130
## 3 Cold Humid FALSE 116 0.0222
## 4 Cold Marine FALSE 175 0.0335
## 5 Desert Cold FALSE 58 0.0112
## 6 Desert Desert TRUE 4925 0.955
## 7 Desert Humid FALSE 34 0.00659
## 8 Desert Marine FALSE 141 0.0273
## 9 Humid Cold FALSE 99 0.0188
## 10 Humid Desert FALSE 64 0.0121
## 11 Humid Humid TRUE 4916 0.932
## 12 Humid Marine FALSE 194 0.0368
## 13 Marine Cold FALSE 88 0.0166
## 14 Marine Desert FALSE 210 0.0396
## 15 Marine Humid FALSE 81 0.0153
## 16 Marine Marine TRUE 4920 0.928
The archetypes are well-separated, with roughly 95% accuracy in every case. Predictions can then also be made for all cities, including those not in the modeling:
archeCities_v05 <- locMapperTibble_v05 %>%
filter(locType != "Exclude") %>%
pull(source)
localeByArchetype(rf_arche_small_TDmcwa_v05,
fullData=mutate(metarData, locType=locMapper_v05[source]),
archeCities=archeCities_v05,
sortDescMatch=TRUE
)
Broadly speaking, cities used in modeling appear to classify in to the appropriate archetypes. For the cities not included in the modeling:
The model can be applied to data from years other than 2016, to see how the classifications are impacted by use in out-of-sample years:
# Predictions on non-2016 data
helperPredictPlot(rf_arche_small_TDmcwa_v05$rfModel,
df=filter(mutate(metarData, hr=lubridate::hour(dtime)), year!=2016),
predOrder=c("Marine", "Humid", "Desert", "Cold"),
locMapper=locMapper_v05
)
## # A tibble: 80 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Chicago, IL (2014) Cold 7928 0.915
## 2 Chicago, IL (2014) Desert 273 0.0315
## 3 Chicago, IL (2014) Humid 144 0.0166
## 4 Chicago, IL (2014) Marine 318 0.0367
## 5 Chicago, IL (2015) Cold 7722 0.886
## 6 Chicago, IL (2015) Desert 247 0.0283
## 7 Chicago, IL (2015) Humid 340 0.0390
## 8 Chicago, IL (2015) Marine 411 0.0471
## 9 Chicago, IL (2017) Cold 7660 0.878
## 10 Chicago, IL (2017) Desert 466 0.0534
## # ... with 70 more rows
Model performance on non-2016 data is not as strong, with roughly a 10%-20% loss of accuracy. Predictions are still much better than null accuracy, and the model (mostly) continues to separate the archetypes.
Suppose that models are run on all 2015-2018 data for Chicago, Las Vegas, New Orleans, and San Diego:
# Create the subset for Chicago, Las Vegas, New Orleans, San Diego
sub_2015_2018_data <- metarData %>%
filter(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"),
year %in% c(2015, 2016, 2017, 2018)
) %>%
mutate(city=str_replace(locale, pattern=" .\\d{4}.", replacement=""),
hr=lubridate::hour(dtime)
)
# Check that proper locales are included
sub_2015_2018_data %>%
count(city, locale)
## # A tibble: 16 x 3
## city locale n
## <chr> <chr> <int>
## 1 Chicago, IL Chicago, IL (2015) 8728
## 2 Chicago, IL Chicago, IL (2016) 8767
## 3 Chicago, IL Chicago, IL (2017) 8740
## 4 Chicago, IL Chicago, IL (2018) 8737
## 5 Las Vegas, NV Las Vegas, NV (2015) 8727
## 6 Las Vegas, NV Las Vegas, NV (2016) 8770
## 7 Las Vegas, NV Las Vegas, NV (2017) 8664
## 8 Las Vegas, NV Las Vegas, NV (2018) 8746
## 9 New Orleans, LA New Orleans, LA (2015) 8727
## 10 New Orleans, LA New Orleans, LA (2016) 8767
## 11 New Orleans, LA New Orleans, LA (2017) 8723
## 12 New Orleans, LA New Orleans, LA (2018) 8737
## 13 San Diego, CA San Diego, CA (2015) 8737
## 14 San Diego, CA San Diego, CA (2016) 8762
## 15 San Diego, CA San Diego, CA (2017) 8744
## 16 San Diego, CA San Diego, CA (2018) 8744
The random forest model is run and cached:
# Run random forest for 2015-2018 data
rf_types_2015_2018_TDmcwha <- rfMultiLocale(sub_2015_2018_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="city",
pred="city",
ntree=50,
seed=2006301420,
mtry=4
)
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
evalPredictions(rf_types_2015_2018_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="city"
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 16 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Chicago, IL Chicago, IL TRUE 9847 0.941
## 2 Chicago, IL Las Vegas, NV FALSE 121 0.0116
## 3 Chicago, IL New Orleans, LA FALSE 236 0.0226
## 4 Chicago, IL San Diego, CA FALSE 258 0.0247
## 5 Las Vegas, NV Chicago, IL FALSE 153 0.0146
## 6 Las Vegas, NV Las Vegas, NV TRUE 10048 0.961
## 7 Las Vegas, NV New Orleans, LA FALSE 60 0.00574
## 8 Las Vegas, NV San Diego, CA FALSE 192 0.0184
## 9 New Orleans, LA Chicago, IL FALSE 224 0.0213
## 10 New Orleans, LA Las Vegas, NV FALSE 98 0.00932
## 11 New Orleans, LA New Orleans, LA TRUE 9859 0.938
## 12 New Orleans, LA San Diego, CA FALSE 329 0.0313
## 13 San Diego, CA Chicago, IL FALSE 173 0.0166
## 14 San Diego, CA Las Vegas, NV FALSE 281 0.0269
## 15 San Diego, CA New Orleans, LA FALSE 197 0.0189
## 16 San Diego, CA San Diego, CA TRUE 9789 0.938
Even with a small forest (50 trees), the model is almost always separating Las Vegas, Chicago, San Diego, and New Orleans. While the climates are very different in these cities, it is striking that the model has so few misclassifications.
How do other cities map against these classifications?
# Predictions on 2015/2018 data
helperPredictPlot(rf_types_2015_2018_TDmcwha$rfModel,
df=filter(mutate(metarData, hr=lubridate::hour(dtime)),
!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))
),
predOrder=c("Chicago, IL", "San Diego, CA", "New Orleans, LA", "Las Vegas, NV")
)
## # A tibble: 104 x 4
## locale predicted n pct
## <chr> <fct> <int> <dbl>
## 1 Atlanta, GA (2016) Chicago, IL 3363 0.384
## 2 Atlanta, GA (2016) Las Vegas, NV 809 0.0924
## 3 Atlanta, GA (2016) New Orleans, LA 3709 0.424
## 4 Atlanta, GA (2016) San Diego, CA 870 0.0994
## 5 Boston, MA (2016) Chicago, IL 7710 0.891
## 6 Boston, MA (2016) Las Vegas, NV 430 0.0497
## 7 Boston, MA (2016) New Orleans, LA 291 0.0336
## 8 Boston, MA (2016) San Diego, CA 223 0.0258
## 9 Dallas, TX (2016) Chicago, IL 2999 0.343
## 10 Dallas, TX (2016) Las Vegas, NV 1147 0.131
## # ... with 94 more rows
Classifications are broadly as expected based on the previous archetype analysis. Variable importances are plotted:
helperPlotVarImp(rf_types_2015_2018_TDmcwha$rfModel)
Dew point and temperature by month continue to be strong factors for separating the four cities in this analysis. SLP, minimum cloud height, and prevailing wind direction are also meaningful.
An assessment can be run for the 2015-2018 model:
# Run for the full model including SLP
probs_2015_2018_TDmcwha <-
assessPredictionCertainty(rf_types_2015_2018_TDmcwha,
keyVar="city",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, SLP",
showAcc=TRUE
)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(keyVar)` instead of `keyVar` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(pkVars)` instead of `pkVars` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
A similar process can be run for assessing the classification of the other cities against the 2015-2017 data for Chicago, Las Vegas, New Orleans, and San Diego:
useData <- metarData %>%
filter(!(str_sub(source, 1, 4) %in% c("kord", "klas", "kmsy", "ksan"))) %>%
mutate(hr=lubridate::hour(dtime))
# Run for the model excluding SLP
probs_allcities_2015_2018_TDmcwh <-
assessPredictionCertainty(rf_types_2015_2018_TDmcwha,
testData=useData,
keyVar="locale",
plotCaption="Temp, Dew Point, Month/Hour, Clouds, Wind, modSLP",
showHists=TRUE
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The model is frequently not so confident in assigning an archetype to related cities, though it frequently gets the most sensible assignment.
An assessment is run to look at the evolution of the model as it take in the ‘next best’ variable for building out the random forest:
# Define the variables to be considered
possVars <- c("TempF", "DewF", "month", "hr", "minHeight",
"ceilingHeight", "WindSpeed", "predomDir", "modSLP"
)
# Create a container for storing the best random forest and variable added from each run
rfContainer <- vector("list", length=length(possVars))
rfVarAdds <- vector("character", length=length(possVars))
# Run the functions using a for loop over the length of possVars
for (ctr in 1:length(possVars)) {
# Pull in the results of the previous run
if (ctr==1) {
prevVars <- c()
} else {
prevVars <- rfVarAdds[1:(ctr-1)]
}
# Run each of them through the combinations
tmpList <- helperRFCombinations(possVars, df=sub_2015_2018_data, prevVars=prevVars)
# Assess the performance
tmpAccuracy <- helperVariableAccuracy(tmpList, possVars=possVars, prevVars=prevVars)
# Prepare to repeat the process
bestRow <- tmpAccuracy %>%
filter(locale=="OOB") %>%
filter(accuracy==max(accuracy))
bestRow
# Update the rfContainer and rfVarAdds elements
rfContainer[[ctr]] <- tmpList[[as.integer(pull(bestRow, vrblNum))]]
rfVarAdds[ctr] <- pull(bestRow, vrblName)
cat("\nVariable Added:", rfVarAdds[ctr], "\n")
}
##
## Will run for: TempF DewF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be:
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: DewF
##
## Will run for: TempF month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: TempF
##
## Will run for: month hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: month
##
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir modSLP
## Fixed variable(s) will be: DewF TempF month
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: modSLP
##
## Will run for: hr minHeight ceilingHeight WindSpeed predomDir
## Fixed variable(s) will be: DewF TempF month modSLP
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: predomDir
##
## Will run for: hr minHeight ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: minHeight
##
## Will run for: hr ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: hr
##
## Will run for: ceilingHeight WindSpeed
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: WindSpeed
##
## Will run for: ceilingHeight
## Fixed variable(s) will be: DewF TempF month modSLP predomDir minHeight hr WindSpeed
##
## Running for locations:
## [1] "Chicago, IL" "Las Vegas, NV" "New Orleans, LA" "San Diego, CA"
##
## Variable Added: ceilingHeight
The evolution of accuracy can then be plotted:
# Pull the accuracy data from the variables selected
tblAccuracy <- map_dfr(rfContainer, .f=helperExtractAccuracy, .id="vrblNum") %>%
mutate(vrblName=rfVarAdds[as.integer(vrblNum)],
locale=factor(locale, levels=c("OOB", unique(locale)[unique(locale) != "OOB"])),
plotLabel=ifelse(as.integer(vrblNum)==1, vrblName, paste0("add ", vrblName))
)
# Plot the gains in accuracy, facetted by 'locale'
ggplot(tblAccuracy, aes(x=fct_reorder(plotLabel, -as.integer(vrblNum)))) +
geom_text(aes(y=accuracy, label=paste0(round(100*accuracy), "%"))) +
facet_wrap(~locale, nrow=1) +
ylim(c(0, 1)) +
geom_hline(yintercept=0.25, lty=2) +
labs(x="",
y="",
title="Evolution of accuracy as next-best variables are added",
caption="Null accuracy is 25%"
) +
coord_flip()
As seen previously, dew point, temperature, and month are significant differentiating variables, driving roughly 75% accuracy in classifying the archetypes.
Adding altimeter (SLP) bumps the accuracy by another ~10%, while adding minimum cloud height and prevailing wind direction bump the accuracy by another ~5%.
This shows some of the power of the random forest algorithm as it is given additional variables to explore. Evolution of error can be plotted to see the impact:
oobError <- c()
for (ctr in 1:9) {
oobError <- c(oobError, rfContainer[[ctr]]$rfModel$err.rate[, "OOB"])
}
tibble::tibble(nVars=rep(1:9, each=25),
ntrees=rep(1:25, times=9),
accuracy=1-oobError
) %>%
ggplot(aes(x=ntrees, y=accuracy)) +
geom_line(aes(group=nVars, color=factor(nVars))) +
labs(x="Number of Trees",
y="OOB Accuracy",
title="Accuracy improves with more trees and more variables"
) +
scale_color_discrete("# Variables")
With a greater number of variables, there is a greater lift in accuracy as the forest grows larger.
Suppose that an attempt is made to classify the year for a single city. Example code includes:
# Create a subset of the data for only Chicago (2014-2019)
sub_kord_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(str_sub(source, 1 ,4) %in% "kord")
# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwha <- rfMultiLocale(sub_kord_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="year",
pred="year",
ntree=200,
seed=2007011309,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 2016 0.798
## 2 2014 2015 FALSE 136 0.0539
## 3 2014 2016 FALSE 85 0.0337
## 4 2014 2017 FALSE 101 0.04
## 5 2014 2018 FALSE 102 0.0404
## 6 2014 2019 FALSE 85 0.0337
## 7 2015 2014 FALSE 126 0.048
## 8 2015 2015 TRUE 2033 0.774
## 9 2015 2016 FALSE 145 0.0552
## 10 2015 2017 FALSE 103 0.0392
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwha$rfModel)
# Evaluate error evolution
errorEvolution(rf_kord_TDmcwha, useCategory="OOB", subT="Chicago (2014-2019)")
## # A tibble: 200 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.491
## 2 2 OOB 0.488
## 3 3 OOB 0.489
## 4 4 OOB 0.476
## 5 5 OOB 0.469
## 6 6 OOB 0.454
## 7 7 OOB 0.442
## 8 8 OOB 0.428
## 9 9 OOB 0.418
## 10 10 OOB 0.408
## # ... with 190 more rows
Interestingly, the model predicts the year with 75%-80% accuracy (null accuracy 17%), suggesting there is significant annual variation in Chicago climate. Sea-level-pressure has the largest variable importance (SLP is a mix of altitude, temperature, dew point, and high/low pressure systems). The first 50 trees significantly reduce the OOB error, with more modest, but still continuing, error reduction afterwards.
The process is run for Las Vegas:
# Create a subset of the data for only Las Vegas (2014-2019)
sub_klas_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(str_sub(source, 1 ,4) %in% "klas")
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwha <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="year",
pred="year",
ntree=200,
seed=2007011405,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwha,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, SLP",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1885 0.730
## 2 2014 2015 FALSE 175 0.0678
## 3 2014 2016 FALSE 142 0.0550
## 4 2014 2017 FALSE 149 0.0577
## 5 2014 2018 FALSE 109 0.0422
## 6 2014 2019 FALSE 121 0.0469
## 7 2015 2014 FALSE 197 0.0757
## 8 2015 2015 TRUE 1711 0.658
## 9 2015 2016 FALSE 162 0.0623
## 10 2015 2017 FALSE 169 0.0650
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwha$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDmcwha, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 200 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.576
## 2 2 OOB 0.569
## 3 3 OOB 0.565
## 4 4 OOB 0.560
## 5 5 OOB 0.554
## 6 6 OOB 0.548
## 7 7 OOB 0.536
## 8 8 OOB 0.527
## 9 9 OOB 0.518
## 10 10 OOB 0.511
## # ... with 190 more rows
Prediction accuracies are lower for Las Vegas, averaging 60%-75% (2014 appears to be much more differentiated than the other years). This is suggestive that there is less annual variation in the Las Vegas climate than there is in the Chicago climate, though still enough to meaningfully pull apart the years.
Modified seal-level pressure is the top predictor, and the majority of the OOB error reduction occurs in the first 50 trees.
Suppose that modSLP is eliminated as a predictor variable, and the process is scaled down to 50 trees and repeated for Chicago:
# Run random forest for Chicago 2014-2019 data
rf_kord_TDmcwh_50 <- rfMultiLocale(sub_kord_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="year",
pred="year",
ntree=50,
seed=2007011410,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_kord_TDmcwh_50,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1621 0.629
## 2 2014 2015 FALSE 194 0.0753
## 3 2014 2016 FALSE 175 0.0679
## 4 2014 2017 FALSE 217 0.0842
## 5 2014 2018 FALSE 197 0.0764
## 6 2014 2019 FALSE 174 0.0675
## 7 2015 2014 FALSE 240 0.0901
## 8 2015 2015 TRUE 1554 0.583
## 9 2015 2016 FALSE 238 0.0893
## 10 2015 2017 FALSE 196 0.0736
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_kord_TDmcwh_50$rfModel)
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
withSLP=rf_kord_TDmcwha$rfModel$err.rate[1:50, "OOB"],
noSLP=rf_kord_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Modified sea-level pressure is revealed to be a significant driver of prediction accuracy for Chicago, confirming findings of the previous variable importance analysis. This suggests different patterns of high and low pressure being in control by year may meaningfully improve the ability to predict Chicago by year.
The process can also be run for Las Vegas:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDmcwh_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="year",
pred="year",
ntree=50,
seed=2007011420,
mtry=4
)
##
## Running for locations:
## [1] 2014 2015 2016 2017 2018 2019
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDmcwh_50,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="year",
locOrder=TRUE
)
## # A tibble: 36 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 2014 2014 TRUE 1610 0.630
## 2 2014 2015 FALSE 258 0.101
## 3 2014 2016 FALSE 177 0.0693
## 4 2014 2017 FALSE 194 0.0759
## 5 2014 2018 FALSE 148 0.0579
## 6 2014 2019 FALSE 168 0.0658
## 7 2015 2014 FALSE 298 0.115
## 8 2015 2015 TRUE 1338 0.516
## 9 2015 2016 FALSE 228 0.0880
## 10 2015 2017 FALSE 227 0.0876
## # ... with 26 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDmcwh_50$rfModel)
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
withSLP=rf_klas_TDmcwha$rfModel$err.rate[1:50, "OOB"],
noSLP=rf_klas_TDmcwh_50$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Modified sea-level pressure is revealed as a meaningful driver of prediction accuracy for Las Vegas also, though with a lesser effect than seen in Chicago. This potentially suggests less variability by year in SLP for Las Vegas.
Since SLP is so meaningful, are there any patterns by year?
sub_kord_data %>%
bind_rows(sub_klas_data, .id="cityFile") %>%
select(cityFile, year, month, modSLP) %>%
mutate(city=case_when(cityFile==1 ~ "Chicago", cityFile==2 ~ "Las Vegas", TRUE ~ "Error")) %>%
filter(!is.na(modSLP)) %>%
ggplot(aes(x=factor(year), y=modSLP)) +
geom_boxplot(fill="lightblue") +
facet_wrap(~city) +
labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Year (Chicago and Las Vegas)")
Chicago has a higher sea-level pressure than Las Vegas (as expected), though neither city shows much variation on average from year to year. The Chicago data are explored further by month:
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
ggplot(aes(x=month, y=modSLP)) +
geom_boxplot(fill="lightblue") +
facet_wrap(~year) +
labs(x="", y="modSLP", title="Modified Sea-Level Pressure by Month and Year (Chicago)")
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
group_by(year, month) %>%
summarize(medSLP=median(modSLP)) %>%
ggplot(aes(x=month, y=medSLP, color=factor(year), group=year)) +
geom_line(lwd=1) +
labs(x="", y="Median modSLP", title="Median modified Sea-Level Pressure by Month and Year (Chicago)")
sub_kord_data %>%
select(year, month, modSLP) %>%
filter(!is.na(modSLP)) %>%
group_by(year, month) %>%
summarize(medSLP=median(modSLP)) %>%
group_by(month) %>%
summarize(sdSLP=sd(medSLP)) %>%
ggplot(aes(x=month, y=sdSLP)) +
geom_point() +
labs(x="",
y="Standard Deviation of Median modSLP by Year",
title="Variability by year of modified Sea-Level Pressure by Month (Chicago)"
)
There are meaningful differences in median modified sea-level pressure by month by year. This is suggestive that several year of data are needed to define an archetype, as there is risk that otherwise the archetype will be an anomalous “high pressure was in contol in June” when that is not generally applicable. And, since high vs. low pressure are often related to temperature, dew point, coludiness, wind speed, and the like, the need for multiple years of data to define an archetype likely extends to all the other variables as well.
With greater variability in SLP by year in the cold-weather months, predictive ability in the cold-weather months is expected to be higher:
rf_kord_TDmcwha$testData %>%
group_by(month) %>%
summarize(pctError=1-mean(correct)) %>%
ggplot(aes(x=month, y=pctError)) +
geom_col(fill="lightblue") +
geom_text(aes(y=pctError/2, label=paste0(round(100*pctError, 1), "%"))) +
labs(x="", y="Error Rate", title="Prediction Error Rate by Month (Chicago)")
As expected, the model takes advantage of greater annual variation in modSLP during the cold season to make better prediction of which year it is during the cold season.
Next an attempt is made to predict the month given the data:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDycwha_month_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"year", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=50,
seed=2007021356,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDycwha_month_50,
plotCaption = "Temp, Dew Point, Year, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 1087 0.809
## 2 Jan Feb FALSE 50 0.0372
## 3 Jan Mar FALSE 27 0.0201
## 4 Jan Apr FALSE 6 0.00447
## 5 Jan May FALSE 5 0.00372
## 6 Jan Nov FALSE 50 0.0372
## 7 Jan Dec FALSE 118 0.0879
## 8 Feb Jan FALSE 75 0.0626
## 9 Feb Feb TRUE 820 0.684
## 10 Feb Mar FALSE 105 0.0876
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDycwha_month_50$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDycwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.536
## 2 2 OOB 0.533
## 3 3 OOB 0.528
## 4 4 OOB 0.521
## 5 5 OOB 0.514
## 6 6 OOB 0.501
## 7 7 OOB 0.494
## 8 8 OOB 0.481
## 9 9 OOB 0.472
## 10 10 OOB 0.462
## # ... with 40 more rows
There is meaningful seasonality to the Las Vegas data such that the predicted month is frequently either the same season or (for spring/fall) the mirror season.
Suppose that year is not available as a predictor:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_50 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=50,
seed=2007021402,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_50,
plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 920 0.697
## 2 Jan Feb FALSE 85 0.0644
## 3 Jan Mar FALSE 33 0.025
## 4 Jan Apr FALSE 6 0.00455
## 5 Jan May FALSE 5 0.00379
## 6 Jan Nov FALSE 70 0.0530
## 7 Jan Dec FALSE 201 0.152
## 8 Feb Jan FALSE 112 0.0945
## 9 Feb Feb TRUE 612 0.516
## 10 Feb Mar FALSE 139 0.117
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_50$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_50, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.618
## 2 2 OOB 0.616
## 3 3 OOB 0.611
## 4 4 OOB 0.605
## 5 5 OOB 0.599
## 6 6 OOB 0.593
## 7 7 OOB 0.587
## 8 8 OOB 0.580
## 9 9 OOB 0.575
## 10 10 OOB 0.565
## # ... with 40 more rows
The same seasonal pattern is observed, though prediction accuracy falls be ~15%. This is suggestive that it is not uncommon for Las Vegas climate to run a month-ahead or a month-behind where it ran in a previous year.
The model appears to still be learning at 50 trees, so it is expanded to 150 trees:
# Run random forest for Las Vegas 2014-2019 data
rf_klas_TDcwha_month_150 <- rfMultiLocale(sub_klas_data,
vrbls=c("TempF", "DewF",
"hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="month",
pred="month",
ntree=150,
seed=2007021402,
mtry=4
)
##
## Running for locations:
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_klas_TDcwha_month_150,
plotCaption = "Temp, Dew Point, Hour of Day, Cloud Height, Wind, Altimeter",
keyVar="month",
locOrder=TRUE
)
## # A tibble: 105 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Jan Jan TRUE 927 0.702
## 2 Jan Feb FALSE 75 0.0568
## 3 Jan Mar FALSE 38 0.0288
## 4 Jan Apr FALSE 6 0.00455
## 5 Jan May FALSE 5 0.00379
## 6 Jan Nov FALSE 69 0.0523
## 7 Jan Dec FALSE 200 0.152
## 8 Feb Jan FALSE 91 0.0768
## 9 Feb Feb TRUE 623 0.526
## 10 Feb Mar FALSE 131 0.111
## # ... with 95 more rows
# Evaluate variable importance
helperPlotVarImp(rf_klas_TDcwha_month_150$rfModel)
# Evaluate error evolution
errorEvolution(rf_klas_TDcwha_month_150, useCategory="OOB", subT="Las Vegas (2014-2019)")
## # A tibble: 150 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.618
## 2 2 OOB 0.616
## 3 3 OOB 0.611
## 4 4 OOB 0.605
## 5 5 OOB 0.599
## 6 6 OOB 0.593
## 7 7 OOB 0.587
## 8 8 OOB 0.580
## 9 9 OOB 0.575
## 10 10 OOB 0.565
## # ... with 140 more rows
Another topic of interest is the use of sea-level pressure or altimeter in modeling. The altimeter setting in a METAR is captured in two ways:
In theory, both metrics report almost the same thing but in different units. There is a small difference in that SLP makes an adjustment for average 12-hour temperature (since temperature is a driver of air pressure) while altimeter is purely an adjustment for altitude.
How does the model perform with neither/either/both or Altimeter and modSLP included? An example will be used for trying to classify a handful of cities from 2016:
sub_mini_2016_data <- metarData %>%
mutate(hr=lubridate::hour(dtime)) %>%
filter(year==2016,
str_sub(source, 1 ,4) %in% c("kord", "kmke", "kmsy", "klas", "ksan", "kden", "kbos", "ksea")
)
First, a model is run with neither Altimeter nor SLP:
# Run random forest for subset data (2016)
rf_mini_TDmcwh <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=100,
seed=2007021422,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
Prediction accuracy and variable importance can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwh,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind",
keyVar="locale",
locOrder=NULL
)
## # A tibble: 64 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Boston, MA (2016) Boston, MA (2016) TRUE 1800 0.695
## 2 Boston, MA (2016) Chicago, IL (2016) FALSE 186 0.0718
## 3 Boston, MA (2016) Denver, CO (2016) FALSE 158 0.0610
## 4 Boston, MA (2016) Las Vegas, NV (2016) FALSE 25 0.00965
## 5 Boston, MA (2016) Milwaukee, WI (2016) FALSE 204 0.0788
## 6 Boston, MA (2016) New Orleans, LA (2016) FALSE 39 0.0151
## 7 Boston, MA (2016) San Diego, CA (2016) FALSE 41 0.0158
## 8 Boston, MA (2016) Seattle, WA (2016) FALSE 137 0.0529
## 9 Chicago, IL (2016) Boston, MA (2016) FALSE 167 0.0641
## 10 Chicago, IL (2016) Chicago, IL (2016) TRUE 1538 0.590
## # ... with 54 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwh$rfModel)
# Evaluate error evolution
errorEvolution(rf_mini_TDmcwh, useCategory="OOB", subT="No Altimeter, No SLP")
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.374
## 2 2 OOB 0.373
## 3 3 OOB 0.369
## 4 4 OOB 0.363
## 5 5 OOB 0.355
## 6 6 OOB 0.345
## 7 7 OOB 0.335
## 8 8 OOB 0.328
## 9 9 OOB 0.317
## 10 10 OOB 0.309
## # ... with 90 more rows
Overall accuracy is roughly 75%, with the greatest challenge being separating Chicago and Milwaukee, with Boston also showing some similarities to these cities.
The process is then run with both Altimeter and SLP:
# Run random forest for subset data (2016)
rf_mini_TDmcwhas <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=100,
seed=2007021429,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
Prediction accuracy and variable importance can again be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_mini_TDmcwhas,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locale",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 62 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Boston, MA (2016) Boston, MA (2016) TRUE 2131 0.837
## 2 Boston, MA (2016) Chicago, IL (2016) FALSE 90 0.0354
## 3 Boston, MA (2016) Denver, CO (2016) FALSE 44 0.0173
## 4 Boston, MA (2016) Las Vegas, NV (2016) FALSE 25 0.00982
## 5 Boston, MA (2016) Milwaukee, WI (2016) FALSE 109 0.0428
## 6 Boston, MA (2016) New Orleans, LA (2016) FALSE 18 0.00707
## 7 Boston, MA (2016) San Diego, CA (2016) FALSE 30 0.0118
## 8 Boston, MA (2016) Seattle, WA (2016) FALSE 98 0.0385
## 9 Chicago, IL (2016) Boston, MA (2016) FALSE 112 0.0412
## 10 Chicago, IL (2016) Chicago, IL (2016) TRUE 1852 0.681
## # ... with 52 more rows
# Evaluate variable importance
helperPlotVarImp(rf_mini_TDmcwhas$rfModel)
# Evaluate error evolution
errorEvolution(rf_mini_TDmcwhas, useCategory="OOB", subT="Both Altimeter and SLP")
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.316
## 2 2 OOB 0.319
## 3 3 OOB 0.319
## 4 4 OOB 0.310
## 5 5 OOB 0.299
## 6 6 OOB 0.291
## 7 7 OOB 0.282
## 8 8 OOB 0.271
## 9 9 OOB 0.261
## 10 10 OOB 0.251
## # ... with 90 more rows
Overall accuracy increases to about 85%. Except for Chicago/Milwaukee, every city is overwhelmingly classified as itself.
Smaller runs with just SLP and just Altimeter are conducted, to verify that the learning rate is roughly the same in these cases. As well SLP plus a dummy variable (SLP + rnorm()) is run:
# Run random forest for subset data (2016)
rf_mini_TDmcwhs <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021435,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
# Run random forest for subset data (2016)
rf_mini_TDmcwha <- rfMultiLocale(sub_mini_2016_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021440,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
set.seed(2007021444)
subUseData <- sub_mini_2016_data %>%
mutate(dummy=modSLP + rnorm(n=nrow(sub_mini_2016_data)))
# Run random forest for subset data (2016)
rf_mini_TDmcwhsd <- rfMultiLocale(subUseData,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "dummy"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=50,
seed=2007021445,
mtry=4
)
##
## Running for locations:
## [1] "Boston, MA (2016)" "Chicago, IL (2016)" "Denver, CO (2016)"
## [4] "Las Vegas, NV (2016)" "Milwaukee, WI (2016)" "New Orleans, LA (2016)"
## [7] "San Diego, CA (2016)" "Seattle, WA (2016)"
OOB error evolution through the first 50 trees can be compared:
# Compare error evolution through the first 50 trees
tibble::tibble(ntree=1:50,
neither=rf_mini_TDmcwh$rfModel$err.rate[1:50, "OOB"],
altimeter=rf_mini_TDmcwha$rfModel$err.rate[1:50, "OOB"],
modslp=rf_mini_TDmcwhs$rfModel$err.rate[1:50, "OOB"],
slp_dummy=rf_mini_TDmcwhsd$rfModel$err.rate[1:50, "OOB"],
both=rf_mini_TDmcwhas$rfModel$err.rate[1:50, "OOB"]
) %>%
pivot_longer(-ntree, names_to="Model", values_to="Error") %>%
ggplot(aes(x=ntree, y=Error, color=Model)) +
geom_line() +
geom_text(data=~filter(., ntree %in% c(1, 50)),
aes(y=Error+0.02, label=paste0(round(100*Error, 1), "%"))
) +
labs(x="# Trees",
y="OOB Error Rate",
title="OOB Error Evolution for SLP Included/Excluded"
) +
ylim(c(0, NA))
Interestingly, the inclusion of both Altimeter and modSLP appears to drive a 2%-3% decrease in classification error. The main difference between the variables is that modSLP contains a hint about the previous 12-hour temperature.
Importantly, including a dummy variable with modSLP has either no impact or a slight negative impact on model performance. So, the addition of Altimeter to modSLP is likely a real impact, not merely the inclusion of two highly correlated variables that are both valuable to driving model performance.
Next, an attempt is made to compare every grouping of two cities, using all variables, mtry of 4, and a very small forest of 15 trees:
# Create a container list to hold the output
list_varimp_2016 <- vector("list", 0.5*length(names_2016)*(length(names_2016)-1))
# Set a random seed
set.seed(2007031342)
# Loop through all possible combinations
n <- 1
for (ctr in 1:(length(names_2016)-1)) {
for (ctr2 in (ctr+1):length(names_2016)) {
list_varimp_2016[[n]] <- rfTwoLocales(mutate(metarData, hr=lubridate::hour(dtime)),
loc1=names_2016[ctr],
loc2=names_2016[ctr2],
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
ntree=15,
mtry=4
)
n <- n + 1
if ((n %% 40) == 0) { cat("Through number:", n, "\n")}
}
}
## Through number: 40
## Through number: 80
## Through number: 120
## Through number: 160
## Through number: 200
## Through number: 240
## Through number: 280
## Through number: 320
## Through number: 360
## Through number: 400
# Create a tibble from the underlying accuracy data
acc_varimp_2016 <- map_dfr(list_varimp_2016, .f=helperAccuracyLocale)
# Assess the top 20 classification accuracies
acc_varimp_2016 %>%
arrange(-accOverall) %>%
head(20)
## # A tibble: 20 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Denver, CO (2016) Miami, FL (2016) 0.998 0.998 0.998
## 2 Denver, CO (2016) Tampa Bay, FL (2016) 0.996 0.998 0.995
## 3 Las Vegas, NV (2016) Miami, FL (2016) 0.995 0.995 0.995
## 4 Denver, CO (2016) New Orleans, LA (201~ 0.995 0.996 0.994
## 5 Denver, CO (2016) San Diego, CA (2016) 0.994 0.994 0.994
## 6 Denver, CO (2016) Houston, TX (2016) 0.993 0.994 0.992
## 7 Denver, CO (2016) San Francisco, CA (2~ 0.993 0.994 0.992
## 8 Miami, FL (2016) Seattle, WA (2016) 0.992 0.990 0.994
## 9 Miami, FL (2016) Phoenix, AZ (2016) 0.992 0.991 0.992
## 10 Miami, FL (2016) Traverse City, MI (2~ 0.991 0.993 0.990
## 11 Miami, FL (2016) Minneapolis, MN (201~ 0.991 0.991 0.990
## 12 Boston, MA (2016) Miami, FL (2016) 0.991 0.990 0.991
## 13 Denver, CO (2016) Los Angeles, CA (201~ 0.991 0.993 0.989
## 14 Denver, CO (2016) San Jose, CA (2016) 0.990 0.991 0.989
## 15 Denver, CO (2016) San Antonio, TX (201~ 0.990 0.992 0.988
## 16 Grand Rapids, MI (201~ Miami, FL (2016) 0.990 0.990 0.990
## 17 Green Bay, WI (2016) Miami, FL (2016) 0.990 0.989 0.990
## 18 Miami, FL (2016) Milwaukee, WI (2016) 0.989 0.989 0.989
## 19 Madison, WI (2016) Miami, FL (2016) 0.989 0.986 0.992
## 20 Las Vegas, NV (2016) Tampa Bay, FL (2016) 0.989 0.990 0.987
# Assess the bottom 20 classification accuracies
acc_varimp_2016 %>%
arrange(accOverall) %>%
head(20)
## # A tibble: 20 x 5
## locale1 locale2 accOverall accLocale1 accLocale2
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Chicago, IL (2016) Milwaukee, WI (2016) 0.675 0.685 0.666
## 2 Newark, NJ (2016) Philadelphia, PA (20~ 0.676 0.687 0.665
## 3 Detroit, MI (2016) Grand Rapids, MI (20~ 0.712 0.718 0.706
## 4 Madison, WI (2016) Milwaukee, WI (2016) 0.718 0.725 0.711
## 5 Philadelphia, PA (201~ Washington, DC (2016) 0.730 0.741 0.718
## 6 Chicago, IL (2016) Grand Rapids, MI (20~ 0.731 0.750 0.711
## 7 Green Bay, WI (2016) Madison, WI (2016) 0.737 0.747 0.728
## 8 Chicago, IL (2016) Detroit, MI (2016) 0.740 0.746 0.735
## 9 Chicago, IL (2016) Madison, WI (2016) 0.745 0.756 0.733
## 10 Grand Rapids, MI (201~ Milwaukee, WI (2016) 0.746 0.747 0.744
## 11 Green Bay, WI (2016) Milwaukee, WI (2016) 0.754 0.757 0.751
## 12 Madison, WI (2016) Minneapolis, MN (201~ 0.758 0.759 0.757
## 13 Grand Rapids, MI (201~ Madison, WI (2016) 0.759 0.763 0.755
## 14 Detroit, MI (2016) Milwaukee, WI (2016) 0.765 0.777 0.753
## 15 Grand Rapids, MI (201~ Traverse City, MI (2~ 0.769 0.770 0.768
## 16 Detroit, MI (2016) Indianapolis, IN (20~ 0.772 0.775 0.770
## 17 Chicago, IL (2016) Indianapolis, IN (20~ 0.775 0.777 0.774
## 18 Boston, MA (2016) Newark, NJ (2016) 0.778 0.781 0.776
## 19 Indianapolis, IN (201~ Saint Louis, MO (201~ 0.779 0.791 0.768
## 20 Madison, WI (2016) Traverse City, MI (2~ 0.783 0.782 0.784
As pre previous, the best accuracies are obtained when comparing cities in very different climates (e.g., Denver vs. Humid/Marine or Miami vs. Desert/Cold), while the worst accuracies are obtained when comparing very similar cities (e.g., Chicago and Milwaukee or Newar and Philadelphia).
Variable importance can then be assessed across all 1:1 classifications:
# Create a tibble of all the variable importance data
val_varimp_2016 <- map_dfr(list_varimp_2016,
.f=function(x) { x$rfModel %>%
caret::varImp() %>%
t() %>%
as.data.frame()
}
) %>%
tibble::as_tibble()
# Create boxplot of overall variable importance
val_varimp_2016 %>%
mutate(num=1:nrow(val_varimp_2016)) %>%
pivot_longer(-num, names_to="variable", values_to="varImp") %>%
ggplot(aes(x=fct_reorder(variable, varImp), y=varImp)) +
geom_boxplot(fill="lightblue") +
labs(x="",
y="Variable Importance",
title="Variable Importance for 1:1 Random Forest Classifications"
)
# Attach the city names and OOB error rate
tbl_varimp_2016 <- sapply(list_varimp_2016,
FUN=function(x) { c(names(x$errorRate[2:3]), x$errorRate["OOB"]) }
) %>%
t() %>%
as.data.frame() %>%
bind_cols(val_varimp_2016) %>%
tibble::as_tibble() %>%
mutate(OOB=as.numeric(as.character(OOB))) %>%
rename(locale1=V1,
locale2=V2
)
# Plot accuracy vs. spikiness of variable importance
tbl_varimp_2016 %>%
pivot_longer(-c(locale1, locale2, OOB), names_to="var", values_to="varImp") %>%
group_by(locale1, locale2, OOB) %>%
summarize(mean=mean(varImp), max=max(varImp)) %>%
mutate(maxMean=max/mean) %>%
ggplot(aes(x=maxMean, y=1-OOB)) +
geom_point() +
geom_smooth(method="loess") +
labs(x="Ratio of Maximum Variable Importance to Mean Variable Importance",
y="OOB Accuracy",
title="Accuracy vs. Spikiness of Variable Importance"
)
Broadly speaking, the same variables that drive overall classification are important in driving 1:1 classifications. There is meaningful spikiness, suggesting that different 1:1 classifications rely on different variables.
There is a strong trend where the best accuracies are obtained where there is a single spiky dimension that drives the classifications. This suggests that while the model can take advantage of all 10 variables, it has the easiest tome when there is a single, well-differentiated variable. No surprise.
An attempt is made to define archetype cities as cities that are either 1) very different from other cities, or 2) very similar to other cities. An archetype should always meet criteria for some meaningful examples, and criteria 2 allows for an archetype to be similar to several other cities of the same archetype:
oobData <- tibble::tibble(City=character(0), OOBLow=numeric(0), OOBHigh=numeric(0))
for (city in cityNameMapper[names_2016]) {
cityData <- tbl_varimp_2016 %>%
filter(locale1 == city | locale2 == city)
lowOOB <- cityData %>%
top_n(n=10, wt=-OOB) %>%
pull(OOB) %>%
mean()
highOOB <- cityData %>%
top_n(n=5, wt=OOB) %>%
pull(OOB) %>%
mean()
cat("\nCity:", city,
"\tOOB (Low 10 Average):", round(lowOOB, 4),
"\tOOB (High 5 Average):", round(highOOB, 4)
)
oobData <- bind_rows(oobData, tibble::tibble(City=city, OOBLow=lowOOB, OOBHigh=highOOB))
}
##
## City: Atlanta, GA (2016) OOB (Low 10 Average): 0.0435 OOB (High 5 Average): 0.1349
## City: Boston, MA (2016) OOB (Low 10 Average): 0.0257 OOB (High 5 Average): 0.1907
## City: Washington, DC (2016) OOB (Low 10 Average): 0.0398 OOB (High 5 Average): 0.1903
## City: Denver, CO (2016) OOB (Low 10 Average): 0.0076 OOB (High 5 Average): 0.0609
## City: Dallas, TX (2016) OOB (Low 10 Average): 0.0409 OOB (High 5 Average): 0.1416
## City: Detroit, MI (2016) OOB (Low 10 Average): 0.0273 OOB (High 5 Average): 0.2446
## City: Newark, NJ (2016) OOB (Low 10 Average): 0.0366 OOB (High 5 Average): 0.2226
## City: Green Bay, WI (2016) OOB (Low 10 Average): 0.0202 OOB (High 5 Average): 0.2269
## City: Grand Rapids, MI (2016) OOB (Low 10 Average): 0.0238 OOB (High 5 Average): 0.2567
## City: Houston, TX (2016) OOB (Low 10 Average): 0.0232 OOB (High 5 Average): 0.1403
## City: Indianapolis, IN (2016) OOB (Low 10 Average): 0.0356 OOB (High 5 Average): 0.2127
## City: Las Vegas, NV (2016) OOB (Low 10 Average): 0.0162 OOB (High 5 Average): 0.0595
## City: Los Angeles, CA (2016) OOB (Low 10 Average): 0.0295 OOB (High 5 Average): 0.0943
## City: Lincoln, NE (2016) OOB (Low 10 Average): 0.0303 OOB (High 5 Average): 0.1682
## City: Miami, FL (2016) OOB (Low 10 Average): 0.0082 OOB (High 5 Average): 0.1027
## City: Milwaukee, WI (2016) OOB (Low 10 Average): 0.0228 OOB (High 5 Average): 0.2685
## City: Madison, WI (2016) OOB (Low 10 Average): 0.0243 OOB (High 5 Average): 0.2567
## City: Minneapolis, MN (2016) OOB (Low 10 Average): 0.022 OOB (High 5 Average): 0.205
## City: New Orleans, LA (2016) OOB (Low 10 Average): 0.0164 OOB (High 5 Average): 0.1368
## City: Chicago, IL (2016) OOB (Low 10 Average): 0.0275 OOB (High 5 Average): 0.2667
## City: Philadelphia, PA (2016) OOB (Low 10 Average): 0.0398 OOB (High 5 Average): 0.2211
## City: Phoenix, AZ (2016) OOB (Low 10 Average): 0.0158 OOB (High 5 Average): 0.0647
## City: San Diego, CA (2016) OOB (Low 10 Average): 0.0226 OOB (High 5 Average): 0.0843
## City: San Antonio, TX (2016) OOB (Low 10 Average): 0.0265 OOB (High 5 Average): 0.1022
## City: Seattle, WA (2016) OOB (Low 10 Average): 0.021 OOB (High 5 Average): 0.0756
## City: San Francisco, CA (2016) OOB (Low 10 Average): 0.0244 OOB (High 5 Average): 0.0895
## City: San Jose, CA (2016) OOB (Low 10 Average): 0.0334 OOB (High 5 Average): 0.1003
## City: Saint Louis, MO (2016) OOB (Low 10 Average): 0.0396 OOB (High 5 Average): 0.1737
## City: Tampa Bay, FL (2016) OOB (Low 10 Average): 0.0136 OOB (High 5 Average): 0.1332
## City: Traverse City, MI (2016) OOB (Low 10 Average): 0.0221 OOB (High 5 Average): 0.2158
# Plot OOB data summary
oobData %>%
ggplot(aes(x=OOBLow, y=OOBHigh)) +
geom_text(aes(label=City)) +
labs(x="Average OOB Error of Best 10 1:1 Predictions",
y="Average OOB Error of Worst 5 1:1 Predictions"
)
At a glance, Miami, Denver and Phoenix/Las Vegas make for great archetypes in this data. They are broadly dissimilar from a large number of cities and not all that similar to many other cities. New Orleans or Tampa may also make for good archetypes (somewhat surprisingly in that these should be similar to Miami).
Suppose that hierarchical clustering is attempted, using 1-OOB as the main ‘distance’ variable:
# Create both halves of what will become the distance matric
upperHalf <- tbl_varimp_2016 %>%
mutate(locale1=as.character(locale1), locale2=as.character(locale2))
lowerHalf <- upperHalf %>%
rename(locale2=locale1, locale1=locale2) %>%
select(locale1, locale2, everything())
bothHalf <- upperHalf %>%
bind_rows(lowerHalf)
# Select only locale1, locale2, OOB and create a record for each locale to itself
mtxData <- bothHalf %>%
select(locale1, locale2, OOB) %>%
bind_rows(tibble::tibble(locale1=unique(bothHalf$locale1), locale2=locale1, OOB=NA)) %>%
arrange(locale1, locale2) %>%
mutate(accuracy=1-OOB)
# Create the matrix and add row/column names
mtxOOB <- matrix(data=mtxData$accuracy, nrow=sqrt(nrow(mtxData)), ncol=sqrt(nrow(mtxData)), byrow=TRUE)
dimnames(mtxOOB) <- list(unique(mtxData$locale1), unique(mtxData$locale1))
# Convert to distance matrix
distOOB <- as.dist(mtxOOB)
# Run hierarchical clustering
distOOB %>%
hclust(method="complete") %>%
plot()
Seattle and Denver appear to be good stand-alone cities. They are not particularly close to any other cities.
Suppose that an attempt is made to classify cities as Seattle, Denver, or Other:
# Create data for Denver/Seattle/Other
sub_densea_data <- metarData %>%
mutate(place=case_when(source=="kden_2016" ~ "Denver",
source=="ksea_2016" ~ "Seattle",
TRUE ~ "Other"
)
) %>%
filter(year==2016)
# Down-sample all to smallest 'place'
nSmallPlace <- sub_densea_data %>%
count(place) %>%
pull(n) %>%
min()
# Group the data and sample to only this minimum size
set.seed(2007031504)
sub_densea_data <- sub_densea_data %>%
group_by(place) %>%
sample_n(size=nSmallPlace) %>%
ungroup() %>%
mutate(hr=lubridate::hour(dtime))
# Run random forest for subset data
rf_densea_TDmcwhsa <- rfMultiLocale(sub_densea_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="place",
pred="place",
ntree=100,
seed=2007031510,
mtry=4
)
##
## Running for locations:
## [1] "Denver" "Other" "Seattle"
# Evaluate prediction accuracy
evalPredictions(rf_densea_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="place",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 9 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Denver Denver TRUE 2522 0.960
## 2 Denver Other FALSE 83 0.0316
## 3 Denver Seattle FALSE 21 0.00800
## 4 Other Denver FALSE 180 0.0685
## 5 Other Other TRUE 2182 0.831
## 6 Other Seattle FALSE 265 0.101
## 7 Seattle Denver FALSE 22 0.00843
## 8 Seattle Other FALSE 115 0.0441
## 9 Seattle Seattle TRUE 2472 0.947
# Evaluate variable importance
helperPlotVarImp(rf_densea_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_densea_TDmcwhsa, subT="Denver, Seattle, Other")
## # A tibble: 400 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.180
## 2 1 Denver 0.115
## 3 1 Other 0.264
## 4 1 Seattle 0.162
## 5 2 OOB 0.191
## 6 2 Denver 0.135
## 7 2 Other 0.268
## 8 2 Seattle 0.169
## 9 3 OOB 0.189
## 10 3 Denver 0.122
## # ... with 390 more rows
The model is very good at picking out Seattle and Denver, but not so good at picking out that Other is not Seattle or Denver. A different approach may be needed to find stand-alone cities and archetypes.
An initial attempt is made to classify all of the 2016 cities, using a small forest of 25 trees, with the objective of finding cities that are frequently classified as each other:
# Run random forest for subset data
rf_all_nt25_TDmcwhsa <- rfMultiLocale(filter(mutate(metarData, hr=lubridate::hour(dtime)), year==2016),
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
ntree=25,
seed=2007041352,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Boston, MA (2016)"
## [3] "Chicago, IL (2016)" "Dallas, TX (2016)"
## [5] "Denver, CO (2016)" "Detroit, MI (2016)"
## [7] "Grand Rapids, MI (2016)" "Green Bay, WI (2016)"
## [9] "Houston, TX (2016)" "Indianapolis, IN (2016)"
## [11] "Las Vegas, NV (2016)" "Lincoln, NE (2016)"
## [13] "Los Angeles, CA (2016)" "Madison, WI (2016)"
## [15] "Miami, FL (2016)" "Milwaukee, WI (2016)"
## [17] "Minneapolis, MN (2016)" "New Orleans, LA (2016)"
## [19] "Newark, NJ (2016)" "Philadelphia, PA (2016)"
## [21] "Phoenix, AZ (2016)" "Saint Louis, MO (2016)"
## [23] "San Antonio, TX (2016)" "San Diego, CA (2016)"
## [25] "San Francisco, CA (2016)" "San Jose, CA (2016)"
## [27] "Seattle, WA (2016)" "Tampa Bay, FL (2016)"
## [29] "Traverse City, MI (2016)" "Washington, DC (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_all_nt25_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locale",
locOrder=NULL
)
## # A tibble: 863 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1776 0.663
## 2 Atlanta, GA (2016) Boston, MA (2016) FALSE 12 0.00448
## 3 Atlanta, GA (2016) Chicago, IL (2016) FALSE 12 0.00448
## 4 Atlanta, GA (2016) Dallas, TX (2016) FALSE 91 0.0340
## 5 Atlanta, GA (2016) Denver, CO (2016) FALSE 10 0.00373
## 6 Atlanta, GA (2016) Detroit, MI (2016) FALSE 20 0.00747
## 7 Atlanta, GA (2016) Grand Rapids, MI (2016) FALSE 9 0.00336
## 8 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 8 0.00299
## 9 Atlanta, GA (2016) Houston, TX (2016) FALSE 43 0.0161
## 10 Atlanta, GA (2016) Indianapolis, IN (2016) FALSE 44 0.0164
## # ... with 853 more rows
# Evaluate variable importance
helperPlotVarImp(rf_all_nt25_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_all_nt25_TDmcwhsa, useCategory="OOB", subT="All 30 cities individually")
## # A tibble: 25 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.604
## 2 2 OOB 0.606
## 3 3 OOB 0.603
## 4 4 OOB 0.597
## 5 5 OOB 0.589
## 6 6 OOB 0.580
## 7 7 OOB 0.568
## 8 8 OOB 0.555
## 9 9 OOB 0.545
## 10 10 OOB 0.535
## # ... with 15 more rows
There are frequently misclassified cities that might benefit from being grouped and classified as such:
Encouragingly, this is broadly the same as the findings from the hierarchical clustering. The model is still learning at a decent rate, so the 25-tree forest is, as expected, too small.
A mapping file is created to map relevant cities together:
# Create the locale mapper
locMapperTibble_ss01 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Same',
'kbos_2016', 'Same',
'kdca_2016', 'DC-PHL-EWR (2016)',
'kden_2016', 'Same',
'kdfw_2016', 'Same',
'kdtw_2016', 'Detroit-Grand Rapids (2016)',
'kewr_2016', 'DC-PHL-EWR (2016)',
'kgrb_2016', 'Same',
'kgrr_2016', 'Detroit-Grand Rapids (2016)',
'kiah_2016', 'Houston-New Orleans (2016)',
'kind_2016', 'Same',
'klas_2016', 'Las Vegas-Phoenix (2016)',
'klax_2016', 'Los Angeles-San Diego (2016)',
'klnk_2016', 'Same',
'kmia_2016', 'Miami-Tampa Bay (2016)',
'kmke_2016', 'Chicago-Milwaukee (2016)',
'kmsn_2016', 'Same',
'kmsp_2016', 'Same',
'kmsy_2016', 'Houston-New Orleans (2016)',
'kord_2016', 'Chicago-Milwaukee (2016)',
'kphl_2016', 'DC-PHL-EWR (2016)',
'kphx_2016', 'Las Vegas-Phoenix (2016)',
'ksan_2016', 'Los Angeles-San Diego (2016)',
'ksat_2016', 'Same',
'ksea_2016', 'Same',
'ksfo_2016', 'Bay Area (2016)',
'ksjc_2016', 'Bay Area (2016)',
'kstl_2016', 'Same',
'ktpa_2016', 'Miami-Tampa Bay (2016)',
'ktvc_2016', 'Same'
)
# Create locMapper
locMapper_ss01 <- locMapperTibble_ss01 %>% pull(locType)
names(locMapper_ss01) <- locMapperTibble_ss01 %>% pull(source)
locMapper_ss01
## katl_2016 kbos_2016
## "Same" "Same"
## kdca_2016 kden_2016
## "DC-PHL-EWR (2016)" "Same"
## kdfw_2016 kdtw_2016
## "Same" "Detroit-Grand Rapids (2016)"
## kewr_2016 kgrb_2016
## "DC-PHL-EWR (2016)" "Same"
## kgrr_2016 kiah_2016
## "Detroit-Grand Rapids (2016)" "Houston-New Orleans (2016)"
## kind_2016 klas_2016
## "Same" "Las Vegas-Phoenix (2016)"
## klax_2016 klnk_2016
## "Los Angeles-San Diego (2016)" "Same"
## kmia_2016 kmke_2016
## "Miami-Tampa Bay (2016)" "Chicago-Milwaukee (2016)"
## kmsn_2016 kmsp_2016
## "Same" "Same"
## kmsy_2016 kord_2016
## "Houston-New Orleans (2016)" "Chicago-Milwaukee (2016)"
## kphl_2016 kphx_2016
## "DC-PHL-EWR (2016)" "Las Vegas-Phoenix (2016)"
## ksan_2016 ksat_2016
## "Los Angeles-San Diego (2016)" "Same"
## ksea_2016 ksfo_2016
## "Same" "Bay Area (2016)"
## ksjc_2016 kstl_2016
## "Bay Area (2016)" "Same"
## ktpa_2016 ktvc_2016
## "Miami-Tampa Bay (2016)" "Same"
# Create the data file with locType
locData_ss01 <- metarData %>%
mutate(locType=ifelse(locMapper_ss01[source]=="Same", locale, locMapper_ss01[source]),
hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_ss01 %>%
count(locType) %>%
arrange(-n) %>%
as.data.frame()
## locType n
## 1 DC-PHL-EWR (2016) 26252
## 2 Las Vegas-Phoenix (2016) 17543
## 3 Miami-Tampa Bay (2016) 17539
## 4 Los Angeles-San Diego (2016) 17536
## 5 Houston-New Orleans (2016) 17535
## 6 Detroit-Grand Rapids (2016) 17534
## 7 Chicago-Milwaukee (2016) 17527
## 8 Bay Area (2016) 17526
## 9 Atlanta, GA (2016) 8772
## 10 Dallas, TX (2016) 8772
## 11 Denver, CO (2016) 8772
## 12 Minneapolis, MN (2016) 8769
## 13 Traverse City, MI (2016) 8766
## 14 Lincoln, NE (2016) 8765
## 15 San Antonio, TX (2016) 8765
## 16 Seattle, WA (2016) 8765
## 17 Green Bay, WI (2016) 8755
## 18 Madison, WI (2016) 8750
## 19 Indianapolis, IN (2016) 8720
## 20 Saint Louis, MO (2016) 8702
## 21 Boston, MA (2016) 8663
The data are then sampled such that each locType is equally prevalent:
# Set a seed for reporducibility
set.seed(2007041419)
# Find the smallest locale type
nSmall <- locData_ss01 %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
ss01_data <- locData_ss01 %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
ss01_data %>%
count(locale) %>%
arrange(-n) %>%
as.data.frame()
## locale n
## 1 Atlanta, GA (2016) 8663
## 2 Boston, MA (2016) 8663
## 3 Dallas, TX (2016) 8663
## 4 Denver, CO (2016) 8663
## 5 Green Bay, WI (2016) 8663
## 6 Indianapolis, IN (2016) 8663
## 7 Lincoln, NE (2016) 8663
## 8 Madison, WI (2016) 8663
## 9 Minneapolis, MN (2016) 8663
## 10 Saint Louis, MO (2016) 8663
## 11 San Antonio, TX (2016) 8663
## 12 Seattle, WA (2016) 8663
## 13 Traverse City, MI (2016) 8663
## 14 San Francisco, CA (2016) 4400
## 15 Las Vegas, NV (2016) 4376
## 16 Milwaukee, WI (2016) 4362
## 17 Miami, FL (2016) 4360
## 18 New Orleans, LA (2016) 4357
## 19 San Diego, CA (2016) 4345
## 20 Detroit, MI (2016) 4334
## 21 Grand Rapids, MI (2016) 4329
## 22 Los Angeles, CA (2016) 4318
## 23 Houston, TX (2016) 4306
## 24 Tampa Bay, FL (2016) 4303
## 25 Chicago, IL (2016) 4301
## 26 Phoenix, AZ (2016) 4287
## 27 San Jose, CA (2016) 4263
## 28 Philadelphia, PA (2016) 2909
## 29 Washington, DC (2016) 2896
## 30 Newark, NJ (2016) 2858
A random forest can then be run to predict locType (using a slightly bigger forest of 50 trees:
# Run random forest for subset data
rf_s01_nt50_TDmcwhsa <- rfMultiLocale(ss01_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locType",
pred="locType",
otherVar=c("dtime", "locale", "source"),
ntree=50,
seed=2007041423,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Bay Area (2016)"
## [3] "Boston, MA (2016)" "Chicago-Milwaukee (2016)"
## [5] "Dallas, TX (2016)" "DC-PHL-EWR (2016)"
## [7] "Denver, CO (2016)" "Detroit-Grand Rapids (2016)"
## [9] "Green Bay, WI (2016)" "Houston-New Orleans (2016)"
## [11] "Indianapolis, IN (2016)" "Las Vegas-Phoenix (2016)"
## [13] "Lincoln, NE (2016)" "Los Angeles-San Diego (2016)"
## [15] "Madison, WI (2016)" "Miami-Tampa Bay (2016)"
## [17] "Minneapolis, MN (2016)" "Saint Louis, MO (2016)"
## [19] "San Antonio, TX (2016)" "Seattle, WA (2016)"
## [21] "Traverse City, MI (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_s01_nt50_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locType",
locOrder=NULL
)
## # A tibble: 432 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 1840 0.718
## 2 Atlanta, GA (2016) Bay Area (2016) FALSE 60 0.0234
## 3 Atlanta, GA (2016) Boston, MA (2016) FALSE 21 0.00819
## 4 Atlanta, GA (2016) Chicago-Milwaukee (2016) FALSE 17 0.00663
## 5 Atlanta, GA (2016) Dallas, TX (2016) FALSE 83 0.0324
## 6 Atlanta, GA (2016) DC-PHL-EWR (2016) FALSE 60 0.0234
## 7 Atlanta, GA (2016) Denver, CO (2016) FALSE 8 0.00312
## 8 Atlanta, GA (2016) Detroit-Grand Rapids (2016) FALSE 28 0.0109
## 9 Atlanta, GA (2016) Green Bay, WI (2016) FALSE 8 0.00312
## 10 Atlanta, GA (2016) Houston-New Orleans (2016) FALSE 52 0.0203
## # ... with 422 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s01_nt50_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_s01_nt50_TDmcwhsa, useCategory="OOB", subT="Subset 01: 21 Groups of 1-3 Locales")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.581
## 2 2 OOB 0.583
## 3 3 OOB 0.579
## 4 4 OOB 0.569
## 5 5 OOB 0.561
## 6 6 OOB 0.552
## 7 7 OOB 0.541
## 8 8 OOB 0.529
## 9 9 OOB 0.517
## 10 10 OOB 0.506
## # ... with 40 more rows
A few other consolidations appear merited:
Example code includes:
# Create the locale mapper
locMapperTibble_ss02 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Same',
'kbos_2016', 'DC-PHL-EWR-BOS (2016)',
'kdca_2016', 'DC-PHL-EWR-BOS (2016)',
'kden_2016', 'Same',
'kdfw_2016', 'Dallas-San Antonio (2016)',
'kdtw_2016', 'Cold Midwest (2016)',
'kewr_2016', 'DC-PHL-EWR-BOS (2016)',
'kgrb_2016', 'Cold Midwest (2016)',
'kgrr_2016', 'Cold Midwest (2016)',
'kiah_2016', 'Gulf Coast (2016)',
'kind_2016', 'St Louis-Indianapolis (2016)',
'klas_2016', 'Las Vegas-Phoenix (2016)',
'klax_2016', 'Los Angeles-San Diego (2016)',
'klnk_2016', 'Same',
'kmia_2016', 'Gulf Coast (2016)',
'kmke_2016', 'Cold Midwest (2016)',
'kmsn_2016', 'Cold Midwest (2016)',
'kmsp_2016', 'Same',
'kmsy_2016', 'Gulf Coast (2016)',
'kord_2016', 'Cold Midwest (2016)',
'kphl_2016', 'DC-PHL-EWR-BOS (2016)',
'kphx_2016', 'Las Vegas-Phoenix (2016)',
'ksan_2016', 'Los Angeles-San Diego (2016)',
'ksat_2016', 'Dallas-San Antonio (2016)',
'ksea_2016', 'Same',
'ksfo_2016', 'Bay Area (2016)',
'ksjc_2016', 'Bay Area (2016)',
'kstl_2016', 'St Louis-Indianapolis (2016)',
'ktpa_2016', 'Gulf Coast (2016)',
'ktvc_2016', 'Same'
)
# Create locMapper
locMapper_ss02 <- locMapperTibble_ss02 %>% pull(locType)
names(locMapper_ss02) <- locMapperTibble_ss02 %>% pull(source)
locMapper_ss02
## katl_2016 kbos_2016
## "Same" "DC-PHL-EWR-BOS (2016)"
## kdca_2016 kden_2016
## "DC-PHL-EWR-BOS (2016)" "Same"
## kdfw_2016 kdtw_2016
## "Dallas-San Antonio (2016)" "Cold Midwest (2016)"
## kewr_2016 kgrb_2016
## "DC-PHL-EWR-BOS (2016)" "Cold Midwest (2016)"
## kgrr_2016 kiah_2016
## "Cold Midwest (2016)" "Gulf Coast (2016)"
## kind_2016 klas_2016
## "St Louis-Indianapolis (2016)" "Las Vegas-Phoenix (2016)"
## klax_2016 klnk_2016
## "Los Angeles-San Diego (2016)" "Same"
## kmia_2016 kmke_2016
## "Gulf Coast (2016)" "Cold Midwest (2016)"
## kmsn_2016 kmsp_2016
## "Cold Midwest (2016)" "Same"
## kmsy_2016 kord_2016
## "Gulf Coast (2016)" "Cold Midwest (2016)"
## kphl_2016 kphx_2016
## "DC-PHL-EWR-BOS (2016)" "Las Vegas-Phoenix (2016)"
## ksan_2016 ksat_2016
## "Los Angeles-San Diego (2016)" "Dallas-San Antonio (2016)"
## ksea_2016 ksfo_2016
## "Same" "Bay Area (2016)"
## ksjc_2016 kstl_2016
## "Bay Area (2016)" "St Louis-Indianapolis (2016)"
## ktpa_2016 ktvc_2016
## "Gulf Coast (2016)" "Same"
# Create the data file with locType
locData_ss02 <- metarData %>%
mutate(locType=ifelse(locMapper_ss02[source]=="Same", locale, locMapper_ss02[source]),
hr=lubridate::hour(dtime)) %>%
filter(year==2016, locType !="Exclude")
# Counts by locType
locData_ss02 %>%
count(locType) %>%
arrange(-n) %>%
as.data.frame()
## locType n
## 1 Cold Midwest (2016) 52566
## 2 Gulf Coast (2016) 35074
## 3 DC-PHL-EWR-BOS (2016) 34915
## 4 Las Vegas-Phoenix (2016) 17543
## 5 Dallas-San Antonio (2016) 17537
## 6 Los Angeles-San Diego (2016) 17536
## 7 Bay Area (2016) 17526
## 8 St Louis-Indianapolis (2016) 17422
## 9 Atlanta, GA (2016) 8772
## 10 Denver, CO (2016) 8772
## 11 Minneapolis, MN (2016) 8769
## 12 Traverse City, MI (2016) 8766
## 13 Lincoln, NE (2016) 8765
## 14 Seattle, WA (2016) 8765
The data are then sampled such that each locType is equally prevalent:
# Set a seed for reporducibility
set.seed(2007041447)
# Find the smallest locale type
nSmall <- locData_ss02 %>%
filter(!is.na(locType)) %>%
count(locType) %>%
pull(n) %>%
min()
# Create the relevant data subset
ss02_data <- locData_ss02 %>%
filter(!is.na(locType)) %>%
group_by(locType) %>%
sample_n(size=nSmall, replace=FALSE) %>%
ungroup()
# Sumarize the data subset
ss02_data %>%
count(locale) %>%
arrange(-n) %>%
as.data.frame()
## locale n
## 1 Atlanta, GA (2016) 8765
## 2 Denver, CO (2016) 8765
## 3 Lincoln, NE (2016) 8765
## 4 Minneapolis, MN (2016) 8765
## 5 Seattle, WA (2016) 8765
## 6 Traverse City, MI (2016) 8765
## 7 Indianapolis, IN (2016) 4477
## 8 San Diego, CA (2016) 4411
## 9 San Francisco, CA (2016) 4399
## 10 Dallas, TX (2016) 4393
## 11 Las Vegas, NV (2016) 4391
## 12 Phoenix, AZ (2016) 4374
## 13 San Antonio, TX (2016) 4372
## 14 San Jose, CA (2016) 4366
## 15 Los Angeles, CA (2016) 4354
## 16 Saint Louis, MO (2016) 4288
## 17 Washington, DC (2016) 2204
## 18 New Orleans, LA (2016) 2197
## 19 Houston, TX (2016) 2196
## 20 Philadelphia, PA (2016) 2196
## 21 Tampa Bay, FL (2016) 2187
## 22 Boston, MA (2016) 2186
## 23 Miami, FL (2016) 2185
## 24 Newark, NJ (2016) 2179
## 25 Chicago, IL (2016) 1525
## 26 Madison, WI (2016) 1468
## 27 Green Bay, WI (2016) 1453
## 28 Grand Rapids, MI (2016) 1445
## 29 Milwaukee, WI (2016) 1442
## 30 Detroit, MI (2016) 1432
A random forest can then be run to predict locType (using a forest of 50 trees):
# Run random forest for subset data
rf_s02_nt50_TDmcwhsa <- rfMultiLocale(ss02_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locType",
pred="locType",
otherVar=c("dtime", "locale", "source"),
ntree=50,
seed=2007041449,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Bay Area (2016)"
## [3] "Cold Midwest (2016)" "Dallas-San Antonio (2016)"
## [5] "DC-PHL-EWR-BOS (2016)" "Denver, CO (2016)"
## [7] "Gulf Coast (2016)" "Las Vegas-Phoenix (2016)"
## [9] "Lincoln, NE (2016)" "Los Angeles-San Diego (2016)"
## [11] "Minneapolis, MN (2016)" "Seattle, WA (2016)"
## [13] "St Louis-Indianapolis (2016)" "Traverse City, MI (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_s02_nt50_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locType",
locOrder=NULL
)
## # A tibble: 193 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2026 0.769
## 2 Atlanta, GA (2016) Bay Area (2016) FALSE 94 0.0357
## 3 Atlanta, GA (2016) Cold Midwest (2016) FALSE 13 0.00493
## 4 Atlanta, GA (2016) Dallas-San Antonio (2016) FALSE 117 0.0444
## 5 Atlanta, GA (2016) DC-PHL-EWR-BOS (2016) FALSE 48 0.0182
## 6 Atlanta, GA (2016) Denver, CO (2016) FALSE 12 0.00455
## 7 Atlanta, GA (2016) Gulf Coast (2016) FALSE 86 0.0326
## 8 Atlanta, GA (2016) Las Vegas-Phoenix (2016) FALSE 33 0.0125
## 9 Atlanta, GA (2016) Lincoln, NE (2016) FALSE 16 0.00607
## 10 Atlanta, GA (2016) Los Angeles-San Diego (2016) FALSE 54 0.0205
## # ... with 183 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s02_nt50_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_s02_nt50_TDmcwhsa, useCategory="OOB", subT="Subset 02: 14 Groups of 1-6 Locales")
## # A tibble: 50 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.519
## 2 2 OOB 0.519
## 3 3 OOB 0.513
## 4 4 OOB 0.505
## 5 5 OOB 0.492
## 6 6 OOB 0.479
## 7 7 OOB 0.468
## 8 8 OOB 0.457
## 9 9 OOB 0.447
## 10 10 OOB 0.435
## # ... with 40 more rows
Overall accuracy continues to improve. Further exploration may allow for continued evolution of the archetypes, particularly as related to the colder cities where accuracy is not improving with consolidation to archetypes.
Data are limited to only locales that do not meaningfully overlap with ‘Cold Midwest’. Functions are written to create the mapper, associated dataset, and associated data sample:
# Create the locale mapper
locMapperTibble_ss03 <- tibble::tribble(
~source, ~locType,
'katl_2016', 'Same',
'kbos_2016', 'Exclude',
'kdca_2016', 'Exclude',
'kden_2016', 'Same',
'kdfw_2016', 'Dallas-San Antonio (2016)',
'kdtw_2016', 'Exclude',
'kewr_2016', 'Exclude',
'kgrb_2016', 'Exclude',
'kgrr_2016', 'Exclude',
'kiah_2016', 'Gulf Coast (2016)',
'kind_2016', 'Exclude',
'klas_2016', 'Las Vegas-Phoenix (2016)',
'klax_2016', 'Los Angeles-San Diego (2016)',
'klnk_2016', 'Exclude',
'kmia_2016', 'Gulf Coast (2016)',
'kmke_2016', 'Exclude',
'kmsn_2016', 'Exclude',
'kmsp_2016', 'Exclude',
'kmsy_2016', 'Gulf Coast (2016)',
'kord_2016', 'Exclude',
'kphl_2016', 'Exclude',
'kphx_2016', 'Las Vegas-Phoenix (2016)',
'ksan_2016', 'Los Angeles-San Diego (2016)',
'ksat_2016', 'Dallas-San Antonio (2016)',
'ksea_2016', 'Same',
'ksfo_2016', 'Bay Area (2016)',
'ksjc_2016', 'Bay Area (2016)',
'kstl_2016', 'Exclude',
'ktpa_2016', 'Gulf Coast (2016)',
'ktvc_2016', 'Exclude'
)
# Create the locale mapper
locMapper_ss03 <- createLocMapper(locMapperTibble_ss03)
# Create the data file with locType
locData_ss03 <- applyLocMapper(metarData, mapper=locMapper_ss03, yearsUse=c(2016))
## locType n
## 1 Gulf Coast (2016) 35074
## 2 Las Vegas-Phoenix (2016) 17543
## 3 Dallas-San Antonio (2016) 17537
## 4 Los Angeles-San Diego (2016) 17536
## 5 Bay Area (2016) 17526
## 6 Atlanta, GA (2016) 8772
## 7 Denver, CO (2016) 8772
## 8 Seattle, WA (2016) 8765
# Create the smaller data file that is balanced by locType
ss03_data <- createBalancedSample(locData_ss03, seed=2007051311)
## locale n
## 1 Atlanta, GA (2016) 8765
## 2 Denver, CO (2016) 8765
## 3 Seattle, WA (2016) 8765
## 4 Las Vegas, NV (2016) 4406
## 5 San Diego, CA (2016) 4399
## 6 San Francisco, CA (2016) 4398
## 7 San Antonio, TX (2016) 4384
## 8 Dallas, TX (2016) 4381
## 9 San Jose, CA (2016) 4367
## 10 Los Angeles, CA (2016) 4366
## 11 Phoenix, AZ (2016) 4359
## 12 New Orleans, LA (2016) 2219
## 13 Tampa Bay, FL (2016) 2188
## 14 Houston, TX (2016) 2179
## 15 Miami, FL (2016) 2179
A random forest can then be run on both the individual locales (locData_ss03) and the balanced sample of locale types (ss03_data):
# Run random forest for subset data
rf_s03_nt100_TDmcwhsa <- rfMultiLocale(ss03_data,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locType",
pred="locType",
otherVar=c("dtime", "locale", "source"),
ntree=100,
seed=2007051315,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Bay Area (2016)"
## [3] "Dallas-San Antonio (2016)" "Denver, CO (2016)"
## [5] "Gulf Coast (2016)" "Las Vegas-Phoenix (2016)"
## [7] "Los Angeles-San Diego (2016)" "Seattle, WA (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_s03_nt100_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locType",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 63 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2156 0.834
## 2 Atlanta, GA (2016) Bay Area (2016) FALSE 72 0.0279
## 3 Atlanta, GA (2016) Dallas-San Antonio (2016) FALSE 144 0.0557
## 4 Atlanta, GA (2016) Denver, CO (2016) FALSE 22 0.00851
## 5 Atlanta, GA (2016) Gulf Coast (2016) FALSE 79 0.0306
## 6 Atlanta, GA (2016) Las Vegas-Phoenix (2016) FALSE 36 0.0139
## 7 Atlanta, GA (2016) Los Angeles-San Diego (2016) FALSE 39 0.0151
## 8 Atlanta, GA (2016) Seattle, WA (2016) FALSE 37 0.0143
## 9 Bay Area (2016) Atlanta, GA (2016) FALSE 33 0.0124
## 10 Bay Area (2016) Bay Area (2016) TRUE 2228 0.837
## # ... with 53 more rows
# Evaluate variable importance
helperPlotVarImp(rf_s03_nt100_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_s03_nt100_TDmcwhsa,
useCategory="OOB",
subT="Exclude Locales with High Overlap to Cold Midwest"
)
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.327
## 2 2 OOB 0.331
## 3 3 OOB 0.322
## 4 4 OOB 0.312
## 5 5 OOB 0.301
## 6 6 OOB 0.290
## 7 7 OOB 0.281
## 8 8 OOB 0.268
## 9 9 OOB 0.256
## 10 10 OOB 0.250
## # ... with 90 more rows
Predictions continue to be highly accurate for Denver, Seattle, and Las Vegas-Phoenix. There is some mixing of the Pacific coastal cities - Bay Area sometimes as Seattle or LA/San Diego, and LA/San Diego sometimes as Bay Area. There is also some meaningful overlap among Dallas-San Antonio, Gulf Coast, and Atlanta.
The process can then be repeated for the individual cities:
# Run random forest for subset data
rf_l03_nt100_TDmcwhsa <- rfMultiLocale(locData_ss03,
vrbls=c("TempF", "DewF",
"month", "hr",
"minHeight", "ceilingHeight",
"WindSpeed", "predomDir",
"modSLP", "Altimeter"
),
locs=NULL,
locVar="locale",
pred="locale",
otherVar=c("dtime", "locale", "source"),
ntree=100,
seed=2007051325,
mtry=4
)
##
## Running for locations:
## [1] "Atlanta, GA (2016)" "Dallas, TX (2016)"
## [3] "Denver, CO (2016)" "Houston, TX (2016)"
## [5] "Las Vegas, NV (2016)" "Los Angeles, CA (2016)"
## [7] "Miami, FL (2016)" "New Orleans, LA (2016)"
## [9] "Phoenix, AZ (2016)" "San Antonio, TX (2016)"
## [11] "San Diego, CA (2016)" "San Francisco, CA (2016)"
## [13] "San Jose, CA (2016)" "Seattle, WA (2016)"
## [15] "Tampa Bay, FL (2016)"
Accuracy can then be assessed:
# Evaluate prediction accuracy
evalPredictions(rf_l03_nt100_TDmcwhsa,
plotCaption = "Temp, Dew Point, Month, Hour of Day, Cloud Height, Wind, Altimeter, and SLP",
keyVar="locale",
locOrder=NULL
)
## Warning: Removed 1 rows containing missing values (geom_text).
## # A tibble: 216 x 5
## locale predicted correct n pct
## <fct> <fct> <lgl> <int> <dbl>
## 1 Atlanta, GA (2016) Atlanta, GA (2016) TRUE 2064 0.775
## 2 Atlanta, GA (2016) Dallas, TX (2016) FALSE 125 0.0469
## 3 Atlanta, GA (2016) Denver, CO (2016) FALSE 23 0.00864
## 4 Atlanta, GA (2016) Houston, TX (2016) FALSE 41 0.0154
## 5 Atlanta, GA (2016) Las Vegas, NV (2016) FALSE 27 0.0101
## 6 Atlanta, GA (2016) Los Angeles, CA (2016) FALSE 49 0.0184
## 7 Atlanta, GA (2016) Miami, FL (2016) FALSE 18 0.00676
## 8 Atlanta, GA (2016) New Orleans, LA (2016) FALSE 34 0.0128
## 9 Atlanta, GA (2016) Phoenix, AZ (2016) FALSE 24 0.00901
## 10 Atlanta, GA (2016) San Antonio, TX (2016) FALSE 54 0.0203
## # ... with 206 more rows
# Evaluate variable importance
helperPlotVarImp(rf_l03_nt100_TDmcwhsa$rfModel)
# Evaluate error evolution
errorEvolution(rf_l03_nt100_TDmcwhsa,
useCategory="OOB",
subT="Exclude Locales with High Overlap to Cold Midwest"
)
## # A tibble: 100 x 3
## ntree Category Error
## <int> <chr> <dbl>
## 1 1 OOB 0.442
## 2 2 OOB 0.440
## 3 3 OOB 0.431
## 4 4 OOB 0.428
## 5 5 OOB 0.418
## 6 6 OOB 0.406
## 7 7 OOB 0.394
## 8 8 OOB 0.380
## 9 9 OOB 0.368
## 10 10 OOB 0.358
## # ... with 90 more rows
At a glance, the combinations chosen from the individual cities seem sensible given how the individual cities tend to (mis)classify.
Random forests can also be used to run regressions, such as on variables like temperature or dew point. Suppose that a few data elements are available, and the goal is to predict the temmperature based on those:
# Build a regression dataset with factored variables for locale and source
regrData <- metarData %>%
mutate(hr=lubridate::hour(lubridate::round_date(dtime, unit="1 hour")),
hrfct=factor(hr),
localefct=factor(locale),
locName=str_replace(locale, pattern=" \\(\\d{4}\\)", replacement=""),
locNamefct=factor(locName)
)
# Define a dependent variable and predictor variables and filtering criteria
depVar <- "TempF"
predVars <- c("DewF", "month", "hrfct", "locNamefct")
critFilter <- list("year"=c(2016),
"locNamefct"=c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA")
)
# Filter such that only non-NA data across all of depVar and predVars are included
subRegData <- regrData %>%
filter_at(vars(all_of(c(depVar, predVars, names(critFilter)))), all_vars(!is.na(.)))
# Filter such that only matches to critFilter are included
for (xNum in seq_len(length(critFilter))) {
subRegData <- subRegData %>%
filter_at(vars(all_of(names(critFilter)[xNum])), ~. %in% critFilter[[xNum]])
}
# Split in to a test set and a train set
testSize <- 0.3
set.seed(2007151232)
idxTrain <- sample(1:nrow(subRegData), size=round((1-testSize)*nrow(subRegData)), replace=FALSE)
trainRegData <- subRegData[idxTrain, ] # will be in random order
testRegData <- subRegData[-idxTrain, ] # will be in sorted order
# Run random forest regression
rfRegrData <- randomForest::randomForest(x=trainRegData[, predVars, drop=FALSE],
y=trainRegData[, depVar, drop=TRUE],
ntree=25,
mtry=2
)
# Assess variable importance
rfRegrData %>%
caret::varImp()
## Overall
## DewF 1940150.5
## month 2886275.7
## hrfct 543114.8
## locNamefct 1848806.0
# Assess evolution of OOB r-squared and RMSE
tibble::tibble(ntree=1:25, rmse=rfRegrData$mse**0.5) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(0, NA)) +
labs(x="# Trees", y="RMSE", title="Evolution of RMSE")
tibble::tibble(ntree=1:25, rmse=rfRegrData$rsq) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# Trees", y="R-squared", title="Evolution of R-squared")
# Apply to test-set data and report metrics
rfPreds <- testRegData %>%
mutate(pred=predict(rfRegrData, newdata=testRegData[, predVars, drop=FALSE]),
globMean=mean(get(depVar))
) %>%
group_by(locale) %>%
mutate(locMean=mean(get(depVar))) %>%
ungroup() %>%
mutate(errActualGlob=(get(depVar)-globMean)**2,
errActualLoc=(get(depVar)-locMean)**2,
errPredict=(get(depVar)-pred)**2
)
# Calculate the errors by locale
rfError <- rfPreds %>%
select(locale, starts_with("err")) %>%
group_by(locale) %>%
summarize_all(~mean(.)**0.5)
# Plot errors by locale
rfError %>%
pivot_longer(-locale, names_to="errorType", values_to="rmse") %>%
ggplot(aes(x=locale, y=rmse, fill=errorType)) +
geom_col(position="dodge") +
labs(x="", y="Average Error", title="Prediction Accuracy by Locale") +
scale_fill_discrete("Actual vs.",
c("errActualGlob", "errActualLoc", "errPredict"),
c("Global Mean", "Locale Mean", "Predicted")
)
# Print errors by locale, including percentage changes
rfError %>%
mutate(pctLocGlob=errActualLoc/errActualGlob,
pctPredLoc=errPredict/errActualLoc,
pctPredGlob=errPredict/errActualGlob
)
## # A tibble: 4 x 7
## locale errActualGlob errActualLoc errPredict pctLocGlob pctPredLoc pctPredGlob
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Chica~ 24.5 20.7 5.47 0.844 0.264 0.223
## 2 Las V~ 19.5 18.5 5.74 0.951 0.310 0.295
## 3 New O~ 14.9 13.1 3.69 0.879 0.282 0.248
## 4 San D~ 6.82 6.82 3.41 1.00 0.499 0.499
A simple random forest significantly reduces null error in predicting temperatures, particularly for Chicago, Las Vegas, and New Orleans. This suggests knowing the dew point and the month enhance predictions for seasonal locales, while being of less use in a marine climate with modest seasonal variation.
Note that San Diego has more accurate predictions than any other locale, it is just that San Diego does not need much more than locale to be accurate. In contrast, Chicago averages ~25 degree error if using global mean, ~20 degrees error if using local mean, and ~5 degrees error if using dew point, month, and hour.
Notably, it appears that running with just a single tree explains nearly 90% of the variance and drives average prediction errors down to under 6 degrees. While there is evolution with additional trees, it appears modest. Potentially, large forests may not be less necessary for this regression analysis phase.
The above process is converted to functional form, with the main function running the filtering, then calling a modified rfTwoLocale() that is adapted to handle classification or regression. The function reRegression() is available in WeatherModelingFunctions_v001.R.
The regression is run again using the same arguments with the new function, and results are cached:
rfRegrData2 <- rfRegression(regrData,
depVar=depVar,
predVars=predVars,
critFilter=critFilter,
seed=2007151232,
ntree=25,
mtry=2,
testSize=0.3
)
The general outcomes are nearly identical:
# Assess variable importance
rfRegrData2$rfModel %>%
caret::varImp()
## Overall
## DewF 1878740.7
## month 2945222.5
## hrfct 545891.1
## locNamefct 1886122.8
# Assess evolution of OOB r-squared and RMSE
tibble::tibble(ntree=1:25, rmse=rfRegrData2$mse**0.5) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(0, NA)) +
labs(x="# Trees", y="RMSE", title="Evolution of RMSE")
tibble::tibble(ntree=1:25, rmse=rfRegrData2$rsq) %>%
ggplot(aes(x=ntree, y=rmse)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# Trees", y="R-squared", title="Evolution of R-squared")
An evaluation function suitable for regresion is written, to cover the following metrics:
Each sub-section is written as its own function, with a wrapper function that calls then all together. The below functions are available in WeatherModelingFunctions_v001.R:
Example code for running the functions alone and together includes:
# Run function #1
plotVariableImportance(rfRegrData,
returnData=TRUE,
caption="temperature ~ f(locale, month, hour, dew point)",
titleAdd=": Predicting Temperature"
)
## # A tibble: 4 x 2
## predictor Overall
## <chr> <dbl>
## 1 DewF 1940151.
## 2 month 2886276.
## 3 hrfct 543115.
## 4 locNamefct 1848806.
# Run function #2
plotMSERSQ(rfRegrData, returnData=TRUE)
## # A tibble: 25 x 3
## ntree rmse rsq
## <int> <dbl> <dbl>
## 1 1 5.78 0.891
## 2 2 5.64 0.897
## 3 3 5.51 0.901
## 4 4 5.39 0.906
## 5 5 5.29 0.909
## 6 6 5.23 0.911
## 7 7 5.19 0.913
## 8 8 5.14 0.914
## 9 9 5.07 0.916
## 10 10 5.04 0.917
## # ... with 15 more rows
# Run function #3
plotErrorByVar(rfRegrData2, depVar="TempF", keyVar="locNamefct", returnData=TRUE)
## # A tibble: 4 x 5
## locNamefct varTot varMod rmse rsq
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Chicago, IL 427. 29.9 5.46 0.930
## 2 Las Vegas, NV 343. 32.9 5.73 0.904
## 3 New Orleans, LA 171. 13.7 3.70 0.920
## 4 San Diego, CA 46.6 11.8 3.43 0.747
# Run function #4
plotActualVsPredicted(rfRegrData2, depVar="TempF", facetVar="locNamefct", returnData=TRUE)
## # A tibble: 10,495 x 9
## actual source dtime DewF month hrfct locNamefct predicted
## <dbl> <chr> <dttm> <dbl> <fct> <fct> <fct> <dbl>
## 1 41 klas_~ 2016-01-01 03:56:00 8.06 Jan 4 Las Vegas~ 42.9
## 2 39.0 klas_~ 2016-01-01 04:56:00 8.06 Jan 5 Las Vegas~ 37.5
## 3 32 klas_~ 2016-01-01 10:56:00 8.96 Jan 11 Las Vegas~ 34.8
## 4 30.0 klas_~ 2016-01-01 11:56:00 8.96 Jan 12 Las Vegas~ 35.1
## 5 42.1 klas_~ 2016-01-01 18:56:00 8.96 Jan 19 Las Vegas~ 45.4
## 6 46.0 klas_~ 2016-01-01 20:56:00 8.96 Jan 21 Las Vegas~ 49.6
## 7 48.0 klas_~ 2016-01-01 22:56:00 10.9 Jan 23 Las Vegas~ 53.8
## 8 45.0 klas_~ 2016-01-02 01:56:00 10.9 Jan 2 Las Vegas~ 49.4
## 9 41 klas_~ 2016-01-02 02:56:00 12.9 Jan 3 Las Vegas~ 51.0
## 10 39.9 klas_~ 2016-01-02 08:56:00 15.1 Jan 9 Las Vegas~ 43.5
## # ... with 10,485 more rows, and 1 more variable: correct <lgl>
# Run combined function
rfOut_002 <- evalRFRegression(rfRegrData2,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month, dew point, hour)",
returnData=TRUE
)
The random forest is then run for an expanded list of locales in 2016, with results cached:
keyLocs3 <- c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA", "Atlanta, GA", "Denver, CO", "Dallas, TX", "Miami, FL", "Phoenix, AZ", "Seattle, WA", "Boston, MA", "San Francisco, CA")
# Run for select 2016 data
rfRegrData3 <- rfRegression(regrData,
depVar="TempF",
predVars=c("DewF", "month", "hrfct", "locNamefct"),
critFilter=list(year=2016, locNamefct=keyLocs3),
seed=2007171405,
ntree=20,
mtry=2,
testSize=0.3
)
The results can then be evaluated:
rfOut_003 <- evalRFRegression(rfRegrData3,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month, dew point, hour)",
returnData=TRUE
)
Month, locale, and dewpoint are all relatively high in variable importance, while hour is low. This is somewhat surprising given that there are usually diurnal temperature changes, though the results suggest that month and dew point have greater impact on temperatures. Denver is the hardest locale for the model to predict.
The process is run again with hour excluded to see how that impacts the results:
keyLocs3a <- c("Chicago, IL", "Las Vegas, NV", "San Diego, CA", "New Orleans, LA", "Atlanta, GA", "Denver, CO", "Dallas, TX", "Miami, FL", "Phoenix, AZ", "Seattle, WA", "Boston, MA", "San Francisco, CA")
# Run for select 2016 data
rfRegrData3a <- rfRegression(regrData,
depVar="TempF",
predVars=c("DewF", "month", "locNamefct"),
otherVar=c("source", "dtime", "hrfct"),
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007171405,
ntree=20,
mtry=2,
testSize=0.3
)
And, the results are again evaluated:
rfOut_003a <- evalRFRegression(rfRegrData3a,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month, dew point)",
returnData=TRUE
)
Run time is significantly faster with the elimination of the 24-category factor, hour. Interestingly, there is no longer mush change in RMSE or R-squared with more trees, suggestive that the main work being done in the previous model is to incorporate the impact of hour. Removing hour meaningfullky degrades both RMSE (~1.5 degrees more delta) and R-squared (~5% less variance explained). Several locales no longer have prediction smooths that follow the 45-degree line, suggesting that hour may be especially helpful in specific locales. Next steps are to explore this further.
Single-variable regressions are run to assess performance, using 1) only dew point, 2) only month, 3) only locale, 4) only hour, 5) only sea-level pressure, and 6) only altimeter:
# Set up the eplanatory variables and other variables to keep from 'testData
varSingles <- c("DewF", "month", "locNamefct", "hrfct", "modSLP", "Altimeter")
otherVars <- c("source", "dtime", "DewF", "month", "locNamefct", "hrfct", "modSLP", "Altimeter")
# Set up listContainerRFReg to hold results
listContainerRFReg <- vector("list", 6)
names(listContainerRFReg) <- varSingles
n <- 1
for (thisVar in varSingles) {
# Run a single-variable regression for TempF vs. thisVar (should not evolve with number of trees)
listContainerRFReg[[n]] <- rfRegression(regrData,
depVar="TempF",
predVars=thisVar,
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007181302+n,
ntree=5,
mtry=1,
testSize=0.3
)
# Increment n
n <- n + 1
}
# R-squared of overall model by model type
rsq <- sapply(listContainerRFReg, FUN=function(x) x$rsq) %>%
tibble::as_tibble() %>%
mutate(ntree=row_number())
# Plot of final R-squared
rsq %>%
filter(ntree==max(ntree)) %>%
pivot_longer(-ntree, names_to="var", values_to="rsq") %>%
ggplot(aes(x=fct_reorder(paste0(var, "\n", varMapper[var]), rsq), y=rsq)) +
geom_col(fill="lightblue") +
geom_text(aes(y=rsq/2, label=round(rsq, 3))) +
coord_flip() +
labs(x="",
y="Overall R-squared",
title="Predicting temperature using only a single variable"
)
At an overall level, month and dew point are the most helpful in differentiating temperatures, followed by locale. Hour is the least helpful, and appears to add value only when other variables have already been included.
How is accuracy in the varying locales impacted by the variable used?
# Impact for just using month
plotErrorByVar(listContainerRFReg$month,
depVar="TempF",
keyVar="locNamefct",
caption="Temperature ~ f(month)"
)
## Warning: Removed 4 rows containing missing values (position_stack).
# Impact for just using dew point
plotErrorByVar(listContainerRFReg$DewF,
depVar="TempF",
keyVar="locNamefct",
caption="Temperature ~ f(dewpoint)"
)
## Warning: Removed 4 rows containing missing values (position_stack).
# Impact for just using locale
plotErrorByVar(listContainerRFReg$locNamefct,
depVar="TempF",
keyVar="month",
caption="Temperature ~ f(locale)"
)
## Warning: Removed 7 rows containing missing values (position_stack).
The issues with using a single predictor are obvious, as many of R-squared by dimension become negative. Specifically:
The month-only result is explored in more detail:
# Predictions based solely on month
listContainerRFReg$month$testData %>%
group_by(locNamefct, month) %>%
summarize(meanTemp=mean(TempF)) %>%
group_by(month) %>%
mutate(meanMonth=mean(meanTemp)) %>%
ungroup() %>%
filter(locNamefct %in% c("Atlanta, GA", "Dallas, TX", "Las Vegas, NV",
"Miami, FL", "San Francisco, CA", "San Diego, CA"
)
) %>%
ggplot(aes(x=month, y=meanTemp)) +
geom_line(aes(group=locNamefct, color=locNamefct)) +
geom_line(data=~filter(., locNamefct=="Atlanta, GA"), aes(y=meanMonth, group=1, lwd=2)) +
labs(x="",
y="Average Temperature",
title="Average Temperature by Month",
subtitle="Overall and by locale"
) +
scale_color_discrete("Locale") +
scale_size_continuous("Overall Average", labels=NULL)
The findings are as expected. Atlanta, Dallas, and Las Vegas have seasonal variations that broadly follow the overall average. In contrast, San Francisco and San Diego tend to be warmer than overall average in winter and colder than overall average in summer. Miami tends to be warmer than overall average in all months, and with not much montly variation.
Clearly, at least two variables are needed. Suppose that for simplification, only the two best factor variable predictors are used:
# Run a regression for TempF vs. month and locale
rfTempvMonthLocale <- rfRegression(regrData,
depVar="TempF",
predVars=c("month", "locNamefct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007181357,
ntree=5,
mtry=2,
testSize=0.3
)
# Evaluate simple regression using only locale and month
rfOut_tvml <- evalRFRegression(rfTempvMonthLocale,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month)",
returnData=TRUE
)
# Impact on predictions by month
plotErrorByVar(rfTempvMonthLocale,
depVar="TempF",
keyVar="month",
caption="Temp ~ f(locale, month)"
)
# Impact on predictions by hour
plotErrorByVar(rfTempvMonthLocale,
depVar="TempF",
keyVar="hrfct",
caption="Temp ~ f(locale, month)"
)
As expected, R-squared is now positive for all elements on the locale and month dimensions. Predictions appear to be less accurate for the 19-24Z range, reflecting roughly the afternoon in the US, and to be most accurate in the 2-8 (early nighttime) and 14-17 (morning) hours. There is no evolution in RMSE or R-squared with number of trees since the model is (essentially) just finding the average temperature by locale and month.
Hour is added to the model for an additional attempt to improve accuracy:
# Run a regression for TempF vs. month and locale and hour
rfTempvMonthLocaleHour <- rfRegression(regrData,
depVar="TempF",
predVars=c("month", "locNamefct", "hrfct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007181421,
ntree=5,
mtry=3,
testSize=0.3
)
Evaluations are then performed:
# Evaluate simple regression using only locale and month and hour
rfOut_tvmlh <- evalRFRegression(rfTempvMonthLocaleHour,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(locale, month, hour)",
returnData=TRUE
)
# Impact on predictions by month
plotErrorByVar(rfTempvMonthLocaleHour,
depVar="TempF",
keyVar="month",
caption="Temp ~ f(locale, month, hour)"
)
# Impact on predictions by hour
plotErrorByVar(rfTempvMonthLocaleHour,
depVar="TempF",
keyVar="hrfct",
caption="Temp ~ f(locale, month, hour)"
)
Accuracy by hour is more uniform, and the overall predictive power has increased.
Run times show the drawback of using random forest regression on such a simple problem:
# Extract the key data
lmData <- regrData %>%
filter(year==2016, locNamefct %in% keyLocs3a)
# Extract the relevant testData
testData <- rfTempvMonthLocaleHour$testData
# Create trainData as lmData anti-joined to testData
trainData <- lmData %>%
anti_join(select(testData, source, dtime))
## Joining, by = c("source", "dtime")
# Get the averages from train data
trainAvg <-
trainData %>%
group_by(locNamefct, month, hrfct) %>%
summarize(predMean=mean(TempF, na.rm=TRUE))
# Merge back to testData
testData <- testData %>%
left_join(trainAvg)
## Joining, by = c("month", "locNamefct", "hrfct")
# Report on RMSE and R-squared
testData %>%
summarize(varTot=var(TempF), varPred=var(predMean-TempF, na.rm=TRUE),
rsq=1-varPred/varTot, rmse=varPred**0.5
)
## # A tibble: 1 x 4
## varTot varPred rsq rmse
## <dbl> <dbl> <dbl> <dbl>
## 1 310. 47.8 0.846 6.92
The identical RMSE and R-squared are obtained in under a second. Note that running lm(TempF ~ locNamefct * month * hrfct, …) takes much longer than the random forest regression due to complexities related to using solve() and storing diagnostic statistics. Model selection should be driven by many factors. In this specific example, they all converge to predicting temperature as the mean for each grouping of factors.
A run of the random forest regression is then made using only a few numeric variables - dew point, sea-level pressure, and altimeter:
# Run a regression for TempF vs. dewpoint and sea-lvele pressure and altimeter
rfTempvDewSLPAltimeter <- rfRegression(regrData,
depVar="TempF",
predVars=c("DewF", "modSLP", "Altimeter"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007191311,
ntree=5,
mtry=3,
testSize=0.3
)
# Evaluate simple regression using only dewpoint and sea-lvele pressure and altimeter
rfOut_tvdsa <- evalRFRegression(rfTempvDewSLPAltimeter,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(dew point, SLP, altimeter)",
returnData=TRUE
)
## Warning: Removed 2 rows containing missing values (position_stack).
# Impact on predictions by month
plotErrorByVar(rfTempvDewSLPAltimeter,
depVar="TempF",
keyVar="month",
caption="Temp ~ f(dew point, SLP, altimeter)"
)
# Impact on predictions by hour
plotErrorByVar(rfTempvDewSLPAltimeter,
depVar="TempF",
keyVar="hrfct",
caption="Temp ~ f(dew point, SLP, altimeter)"
)
Dew point and sea-level pressure stand out as more important then altimeter. Notably, unlike the categorical regression, the model continues to learn with more trees suggesting that 5 is too few. Additionally, predictions are “worse than null” for San Diego and San Francsico, suggesting that the locale variable is vital for establishing differing relationships of temperature to dew point in different locales.
An additional version of the model is run taking the two highest importance variables from each of the previous regressions - locale, month, dew point, and sea-level pressure:
# Run a regression for TempF vs. month, locale, dew point, sea-level pressure
rfTempvDewSLPMonthLocale <- rfRegression(regrData,
depVar="TempF",
predVars=c("DewF", "modSLP", "month", "locNamefct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007191321,
ntree=20,
mtry=2,
testSize=0.3
)
# Evaluate regression using dew point, sea-level pressure, month, locale
rfOut_tvdsml <- evalRFRegression(rfTempvDewSLPMonthLocale,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(dew point, SLP, month, locale)",
returnData=TRUE
)
# Impact on predictions by month
plotErrorByVar(rfTempvDewSLPMonthLocale,
depVar="TempF",
keyVar="month",
caption="Temp ~ f(dew point, SLP, month, locale)"
)
# Impact on predictions by hour
plotErrorByVar(rfTempvDewSLPMonthLocale,
depVar="TempF",
keyVar="hrfct",
caption="Temp ~ f(dew point, SLP, month, locale)"
)
Month is the most important variable, followed by locale and dew point. Sea-level pressure appears to have less influence. RMSE is down slightly and R-squared is up slightly, with predictions continuing to be hardest in Denver (high RMSE) and San Diego/San Francisco (low r-squared).
There continues to be a pattern where times in 18Z-24Z (roughly late afternoon) are predicted much worse then times in 2-8Z (roughly the first half of nighttime), potentially suggesting that including hour as an explanatory variable could help.
A version of the random forest regression is then run that adds hour and altimeter to the mix. The mtry variable is held at 2:
# Run a regression for TempF vs. six key variables
rfTempvBig6 <- rfRegression(regrData,
depVar="TempF",
predVars=c("DewF", "modSLP", "Altimeter", "month", "locNamefct", "hrfct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=keyLocs3a),
seed=2007191338,
ntree=20,
mtry=2,
testSize=0.3
)
# Evaluate regression using six key variables
rfOut_tvbig6 <- evalRFRegression(rfTempvBig6,
depVar="TempF",
keyVar="locNamefct",
facetVar="locNamefct",
caption="Temp ~ f(dew point, SLP, altimeter, month, locale, hour)",
returnData=TRUE
)
# Impact on predictions by month
plotErrorByVar(rfTempvBig6,
depVar="TempF",
keyVar="month",
caption="Temp ~ f(dew point, SLP, altimeter, month, locale, hour)"
)
# Impact on predictions by hour
plotErrorByVar(rfTempvBig6,
depVar="TempF",
keyVar="hrfct",
caption="Temp ~ f(dew point, SLP, altimeter, month, locale, hour)"
)
While hour and altimeter show as lower on variable importance, overall model performance is significantly improved. Specifically, RMSE is down to ~4.5 and still falling with more trees, while r-squared is up to ~0.93 and still rising with more trees. Notably, many of the locales are now at 90%+ r-squared, with even San Diego and San Francisco now being accurately classified.
Next steps are to focus specifically on some of the tougher locales (Denver, San Francisco, San Diego) and times (20Z for example) to explore what is driving those challenges.
Since locale and month are both known to be important, do the forests run more quickly if they are run for subsets of the data?
regrSubData <- regrData %>%
filter(year==2016, locNamefct %in% keyLocs3a)
locsRegSub <- regrSubData %>%
pull(locNamefct) %>%
as.character() %>%
unique() %>%
sort()
Forests are then run for each locale:
startTime <- Sys.time()
# Create a container to hold the output
lstRegSub <- vector("list", length(locsRegSub))
names(lstRegSub) <- locsRegSub
n <- 1
for (loc in locsRegSub) {
# Run a regression for TempF vs. month, hour, dew point, sea-level pressure, altitude
lstRegSub[[n]] <- rfRegression(regrSubData,
depVar="TempF",
predVars=c("DewF", "modSLP", "Altimeter", "month", "hrfct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=loc),
seed=2007201355+n,
ntree=20,
mtry=2,
testSize=0.3
)
# Incerement counter
n <- n + 1
# Update progress
cat("\nFinished processing:", loc)
}
##
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Las Vegas, NV
## Finished processing: Miami, FL
## Finished processing: New Orleans, LA
## Finished processing: Phoenix, AZ
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: Seattle, WA
endTime <- Sys.time()
cat("\nTotal run time:", round(endTime - startTime, 1), "seconds\n")
##
## Total run time: 20.4 seconds
This process runs in around 20 seconds, roughly 100x faster than running the random forest with all the data together. In this case, locale is known to be strongly predictive, so splitting it out may make sense. Does the timing change if month is also split out?
startTime <- Sys.time()
# Create a container to hold the output
lstRegSub_002 <- vector("list", length(locsRegSub)*length(month.abb))
n <- 1
for (loc in locsRegSub) {
for (mon in month.abb) {
# Run a regression for TempF vs. hour, dew point, sea-level pressure, altitude
lstRegSub_002[[n]] <- rfRegression(regrSubData,
depVar="TempF",
predVars=c("DewF", "modSLP", "Altimeter", "hrfct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=loc, month=mon),
seed=2007201405+n,
ntree=20,
mtry=2,
testSize=0.3
)
# Incerement counter
n <- n + 1
}
# Update progress
cat("\nFinished processing:", loc)
}
##
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Las Vegas, NV
## Finished processing: Miami, FL
## Finished processing: New Orleans, LA
## Finished processing: Phoenix, AZ
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: Seattle, WA
endTime <- Sys.time()
cat("\nTotal run time:", round(endTime - startTime, 1), "seconds\n")
##
## Total run time: 25.3 seconds
There is no obvious improvement in run time, suggesting that the increased overhead of the for loops may be offsetting any efficiency gains from growing much simpler trees on smaller datasets.
The data from the locale regressions is combined, for purposes of running diagnostics:
# Integrate the test data
subTestData <- map_dfr(lstRegSub, .f=function(x) { x$testData} )
# Integrate the r-squared data
subRSQ <- map_dfr(lstRegSub,
.f=function(x) { tibble::tibble(ntree=1:length(x$rsq), rsq=x$rsq) },
.id="source"
)
# Integrate the RMSE data
subRMSE <- map_dfr(lstRegSub,
.f=function(x) { tibble::tibble(ntree=1:length(x$mse), rmse=x$mse**0.5) },
.id="source"
)
# Plot integrated evolution of RSQ
helperPlotEvolution <- function(df, plotVar) {
p1 <- df %>%
ggplot(aes_string(x="ntree", y=plotVar)) +
geom_line(aes(group=source, color=source)) +
labs(x="Number of trees",
title=paste0("Evolution of ", str_to_upper(plotVar), " by locale")
)
if (plotVar=="rsq") { p1 <- p1 + ylim(c(NA, 1)) }
if (plotVar=="rmse") { p1 <- p1 + ylim(c(0, NA)) }
print(p1)
}
# Plot final R-squared and RMSE by locale
ebvRFSub_loc <- plotErrorByVar(list(testData=subTestData),
depVar="TempF",
keyVar="locNamefct",
returnData=TRUE
)
# Plot final R-squared and RMSE by month
ebvRFSub_mon <- plotErrorByVar(list(testData=subTestData),
depVar="TempF",
keyVar="month",
returnData=TRUE
)
# Plot final R-squared and RMSE by hour
ebvRFSub_hr <- plotErrorByVar(list(testData=subTestData),
depVar="TempF",
keyVar="hrfct",
returnData=TRUE
)
# Show differences by model run
ebvData <- bind_rows(ebvRFSub_loc, rfOut_tvbig6$f3, .id="source") %>%
mutate(source=ifelse(source==1, "Run by locale", "Run overall"))
# Plot differences in R-squared
ebvData %>%
ggplot(aes(x=fct_reorder(locNamefct, rsq), y=rsq)) +
geom_point(aes(color=source)) +
coord_flip() +
ylim(c(NA, 1)) +
labs(x="", y="R-squared", title="R-squared by modeling approach") +
scale_color_discrete("Model")
# Plot differences in RMSE
ebvData %>%
ggplot(aes(x=fct_reorder(locNamefct, rmse), y=rmse)) +
geom_point(aes(color=source)) +
coord_flip() +
ylim(c(0, NA)) +
labs(x="", y="RMSE", title="RMSE by modeling approach") +
scale_color_discrete("Model")
If anything, the models run by locale appear to be slightly more accurate than the models run overall. The differences in variable importance can also be assessed:
# Grab the variable importance for each locale
impLstRegSub <- map_dfr(lstRegSub,
.f=function(x) { x$rfModel$importance %>% t() %>% tibble::as_tibble() },
.id="locNamefct"
)
# Get the percentage by variable, then plot
impLstRegSub %>%
pivot_longer(-locNamefct, names_to="variable", values_to="importance") %>%
group_by(locNamefct) %>%
mutate(pct=importance/sum(importance)) %>%
ungroup() %>%
ggplot(aes(x=locNamefct, y=variable)) +
geom_tile(aes(fill=pct)) +
geom_text(aes(label=round(pct, 3))) +
coord_flip() +
labs(x="", y="", title="Variable importance by locale (scaled to add to 100% in each locale)") +
scale_fill_continuous("Percent", low="white", high="green")
There are some interesting differences in variable importance by locale:
Portions of the difference are driven by seasonal extremes. For example, Chicago has four distinct seasons with dew point correlated to temperature in each. Thus, there is a lot of variance in Chicago temperature data, and much of that variance can be explained by dewpoint. In contrast, Las Vegas tends to have low dew points all year and the marine locales tend to have moderate dew points all year. As such, there is little explanatory power of dew point in these locales.
Another area to examine is the explanatory power for random forest regressions by a mix of two factor variables. A helper function is written to explore this - helperAccuracyByFactors() is available in file WeatherModelingFunctions_v001.R:
helperAccuracyByFactors(subTestData, xVar="locNamefct", yVar="month", depVar="TempF", metric="rsq")
helperAccuracyByFactors(subTestData, xVar="locNamefct", yVar="month", depVar="TempF", metric="rmse")
There are some rather low explanatory powers for some of these intersections. A chunk of the model’s overall explanatory power by locale is in finding differences in average temperature by month, and a few of the R-squared suggest there is not much more being learned about the within month differences. Boston in July stands out - what does the explanatory power by hour look like? Hours are grouped by three to minimize buckets with very small data:
subJuly <- subTestData %>%
filter(month=="Jul") %>%
mutate(hrBucket=factor(3 * (((as.integer(hrfct)-1) %% 24) %/% 3) + 1))
helperAccuracyByFactors(subJuly, xVar="locNamefct", yVar="hrBucket", depVar="TempF", metric="rsq")
There is clearly significant room for improvement, as several of the R-squared values are negative. Suppose that random forest regressions are run for each locale, using only July data:
startTime <- Sys.time()
# Create a container to hold the output
lstRegSub_jul <- vector("list", length(locsRegSub))
n <- 1
for (loc in locsRegSub) {
for (mon in month.abb[7]) {
# Run a regression for TempF vs. hour, dew point, sea-level pressure, altitude
lstRegSub_jul[[n]] <- rfRegression(regrSubData,
depVar="TempF",
predVars=c("DewF", "modSLP", "Altimeter", "hrfct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=loc, month=mon),
seed=2007211342+n,
ntree=100,
mtry=3,
testSize=0.3
)
# Incerement counter
n <- n + 1
}
# Update progress
cat("\nFinished processing:", loc)
}
##
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Las Vegas, NV
## Finished processing: Miami, FL
## Finished processing: New Orleans, LA
## Finished processing: Phoenix, AZ
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: Seattle, WA
endTime <- Sys.time()
cat("\nTotal run time:", round(endTime - startTime, 1), "seconds\n")
##
## Total run time: 8.2 seconds
The evaluation process can then be run again:
# Integrate the test data
subTestData_jul <- map_dfr(lstRegSub_jul, .f=function(x) { x$testData} )
# Plot final R-squared and RMSE by locale
ebvRFSub_loc_jul <- plotErrorByVar(list(testData=subTestData_jul),
depVar="TempF",
keyVar="locNamefct",
returnData=TRUE
)
# Plot final R-squared and RMSE by hour
ebvRFSub_hr_jul <- plotErrorByVar(list(testData=subTestData_jul),
depVar="TempF",
keyVar="hrfct",
returnData=TRUE
)
subJuly_002 <- subTestData_jul %>%
filter(month=="Jul") %>%
mutate(hrBucket=factor(3 * (((as.integer(hrfct)-1) %% 24) %/% 3) + 1))
helperAccuracyByFactors(subJuly_002, xVar="locNamefct", yVar="hrBucket", depVar="TempF", metric="rsq")
Boston remains the hardest to predict in July, and there has been only a little imporvement in RMSE or R-squared. Why is the Dallas data still showing negative R-squared?
subPlotJuly <- subJuly_002 %>%
filter(locNamefct=="Dallas, TX", hrBucket==7)
subPlotJuly
## # A tibble: 33 x 12
## TempF source dtime DewF month locNamefct hrfct modSLP
## <dbl> <chr> <dttm> <dbl> <fct> <fct> <fct> <dbl>
## 1 80.1 kdfw_~ 2016-07-01 06:53:00 66.0 Jul Dallas, TX 7 1014.
## 2 82.0 kdfw_~ 2016-07-04 06:53:00 75.0 Jul Dallas, TX 7 1010.
## 3 82.0 kdfw_~ 2016-07-06 07:53:00 72.0 Jul Dallas, TX 8 1011.
## 4 82.9 kdfw_~ 2016-07-11 06:53:00 70.0 Jul Dallas, TX 7 1010.
## 5 84.0 kdfw_~ 2016-07-12 05:53:00 66.9 Jul Dallas, TX 6 1010.
## 6 81.0 kdfw_~ 2016-07-12 07:53:00 70.0 Jul Dallas, TX 8 1010.
## 7 84.9 kdfw_~ 2016-07-13 05:53:00 70.0 Jul Dallas, TX 6 1012.
## 8 84.0 kdfw_~ 2016-07-13 06:53:00 71.1 Jul Dallas, TX 7 1012.
## 9 84.0 kdfw_~ 2016-07-14 07:53:00 71.1 Jul Dallas, TX 8 1013.
## 10 82.0 kdfw_~ 2016-07-16 05:53:00 68 Jul Dallas, TX 6 1014.
## # ... with 23 more rows, and 4 more variables: Altimeter <dbl>,
## # predicted <dbl>, correct <lgl>, hrBucket <fct>
subPlotJuly %>%
ggplot(aes(x=TempF, y=predicted)) +
geom_point() +
geom_abline(lty=2, color="red") +
labs(x="Actual Temperature", y="Predicted Temperature") +
geom_text(data=~filter(., TempF < 78), aes(x=TempF+0.25, label=dtime), hjust=0, color="red") +
coord_equal()
The model appears to be reasonable in this case. There is an anomalously low temperature of 75 degrees at 6Z on July 10. The model is predicts a value that is roughly in-line with what would typically be expected for Dallas in July at midnight. A plot of Dallas Temperatures and Dew Points in July suggests that while low temperatures in the mid-70s are rare, they occur several times:
regrData %>%
filter(source=="kdfw_2016", month=="Jul", !is.na(TempF), !is.na(DewF)) %>%
ggplot(aes(x=dtime, y=TempF)) +
geom_line(color="red") +
geom_line(aes(y=DewF), color="blue") +
labs(x="", y="Temp/Dew (F)", title="Dallas, TX in July 2016")
The dew point does not help predict these low temperature evenings. Suppose 76 degrees is used as the cutoff for a “cold” July occurrence in Dallas. Does anything stand out about these times?
dfw_jul <- regrData %>%
filter(source=="kdfw_2016", month=="Jul", !is.na(TempF), !is.na(DewF)) %>%
mutate(isCold=TempF <= 76.5)
dfw_jul%>%
select(-year, -monthint, -day, -starts_with("cL"), -starts_with("orig"), -hr) %>%
group_by(isCold) %>%
summarize_if(is.numeric, .funs=mean, na.rm=TRUE)
## # A tibble: 2 x 13
## isCold WindSpeed WindGust Visibility Altimeter TempF DewF modSLP p1Inches
## <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 10.2 22.3 10.0 30.0 87.7 69.9 1013. 0.0715
## 2 TRUE 7.35 30 8.96 30.0 74.1 67.8 1014. 0.284
## # ... with 4 more variables: p36Inches <dbl>, p24Inches <dbl>, tempFHi <dbl>,
## # tempFLo <dbl>
dfw_jul%>%
select(-year, -monthint, -day, -starts_with("cL"), -starts_with("orig"), -hr) %>%
group_by(isCold) %>%
summarize_if(is.logical, .funs=mean, na.rm=TRUE)
## # A tibble: 2 x 4
## isCold isRain isSnow isThunder
## <lgl> <dbl> <dbl> <dbl>
## 1 FALSE 0.00565 0 0.0141
## 2 TRUE 0.304 0 0.261
dfw_jul%>%
count(isCold, predomDir) %>%
group_by(isCold) %>%
mutate(pct=n/sum(n)) %>%
pivot_wider(predomDir, names_from="isCold", values_from="pct")
## # A tibble: 10 x 3
## predomDir `FALSE` `TRUE`
## <fct> <dbl> <dbl>
## 1 000 0.0141 0.0870
## 2 VRB 0.0141 NA
## 3 NE 0.0155 0.0870
## 4 E 0.0325 0.0435
## 5 SE 0.177 0.304
## 6 S 0.653 0.348
## 7 SW 0.0749 0.0435
## 8 W 0.00847 NA
## 9 NW 0.00282 NA
## 10 N 0.00847 0.0870
dfw_jul%>%
count(isCold, minHeight) %>%
group_by(isCold) %>%
mutate(pct=n/sum(n)) %>%
pivot_wider(minHeight, names_from="isCold", values_from="pct")
## # A tibble: 5 x 3
## minHeight `FALSE` `TRUE`
## <fct> <dbl> <dbl>
## 1 Surface 0.00424 0.217
## 2 Low 0.0918 0.174
## 3 Medium 0.288 0.0870
## 4 High 0.145 0.391
## 5 None 0.470 0.130
dfw_jul%>%
count(isCold, ceilingHeight) %>%
group_by(isCold) %>%
mutate(pct=n/sum(n)) %>%
pivot_wider(ceilingHeight, names_from="isCold", values_from="pct")
## # A tibble: 4 x 3
## ceilingHeight `FALSE` `TRUE`
## <fct> <dbl> <dbl>
## 1 Low 0.0127 NA
## 2 Medium 0.0113 0.0870
## 3 High 0.0297 0.217
## 4 None 0.946 0.696
It appears that a “cold” July evening in Dallas is associated with precipitation, clouds, and winds from the east. Does this match up with data from July 10?
metarData %>%
filter(source=="kdfw_2016", month=="Jul", lubridate::day(dtime)==10, lubridate::hour(dtime) < 12) %>%
pull(origMETAR)
## [1] "KDFW 100053Z 17006KT 10SM -RA FEW035 FEW060 BKN150 BKN250 24/21 A2994 RMK AO2 PRESFR SLP130 OCNL LTGIC DSNT S CB DSNT S MOV SE P0001 T02440211 $"
## [2] "KDFW 100153Z 06007KT 10SM FEW060 SCT110 BKN200 25/22 A2997 RMK AO2 RAE32 SLP141 OCNL LTGIC DSNT S CB DSNT S MOV SE P0000 T02500217 $"
## [3] "KDFW 100253Z 12003KT 10SM FEW060 FEW110 BKN210 25/20 A2994 RMK AO2 SLP130 OCNL LTGIC DSNT S CB DSNT S MOV SE 60001 T02500200 56023 $"
## [4] "KDFW 100353Z 15003KT 10SM FEW120 BKN250 OVC300 24/20 A2996 RMK AO2 SLP136 CB DSNT S MOVD SE T02440200 $"
## [5] "KDFW 100453Z 12005KT 10SM FEW120 SCT250 BKN300 24/20 A2998 RMK AO2 SLP143 T02440200 $"
## [6] "KDFW 100553Z 12006KT 10SM FEW120 SCT250 BKN300 24/20 A2998 RMK AO2 SLP143 60001 T02390200 10256 20239 403500228 51014 $"
## [7] "KDFW 100653Z 18003KT 10SM FEW100 SCT250 SCT300 24/19 A2998 RMK AO2 SLP142 T02440194 $"
## [8] "KDFW 100753Z 18003KT 10SM FEW120 SCT300 23/19 A2996 RMK AO2 SLP136 T02330194 $"
## [9] "KDFW 100853Z 00000KT 10SM FEW250 SCT300 23/20 A2995 RMK AO2 SLP131 T02280200 58012 $"
## [10] "KDFW 100953Z 11004KT 10SM FEW300 23/20 A2995 RMK AO2 SLP134 T02280200 $"
## [11] "KDFW 101053Z 15006KT 10SM FEW110 SCT300 22/20 A2995 RMK AO2 SLP135 T02220200 $"
## [12] "KDFW 101153Z 16004KT 10SM FEW110 FEW300 23/21 A2996 RMK AO2 SLP138 70003 T02280206 10244 20222 53005 $"
To a degree, yes. There were clouds and rain earlier in the evening, though these had cleared out by the time of the anomaly. Potentially, an area for further exploration.
Next, the evolution of RMSE as variables are added is plotted. The main categorical variables - locale, month, and hour - are added first, as these are largely just the average for the intersection of variables:
# Sample test/train dataset
set.seed(2007221340)
nullData <- regrData %>%
filter(!is.na(TempF), year==2016, TempF <= 120, DewF <= 100, TempF >= DewF) %>%
mutate(isTest=runif(nrow(.))<=0.3)
# Null estimates for temperature based on locale; locale-month; and locale-month-hour
nullMeans <- nullData %>%
filter(!isTest) %>%
group_by(locNamefct) %>%
mutate(meanLoc=mean(TempF)) %>%
group_by(locNamefct, month) %>%
mutate(meanLocMonth=mean(TempF)) %>%
group_by(locNamefct, month, hrfct) %>%
summarize(meanLoc=mean(meanLoc), meanLocMonth=mean(meanLocMonth), meanLocMonthHour=mean(TempF), n=n()) %>%
ungroup()
# Assessment of accuracies for temperature by locale
nullTest <- nullData %>%
filter(isTest) %>%
right_join(nullMeans, by=c("locNamefct", "month", "hrfct")) %>%
mutate(eLoc=meanLoc-TempF,
eLocMonth=meanLocMonth-TempF,
eLocMonthHour=meanLocMonthHour-TempF
)
# Create overall metrics
nullOverall <- nullTest %>%
mutate(locNamefct="Overall") %>%
group_by(locNamefct) %>%
summarize(varTot=var(TempF),
eLoc2=mean(eLoc**2),
eLocMonth=mean(eLocMonth**2),
eLocMonthHour=mean(eLocMonthHour**2)
)
# Create locale metrics, and bind overall metrics
nullByLocale <- nullTest %>%
mutate(locNamefct=as.character(locNamefct)) %>%
group_by(locNamefct) %>%
summarize(varTot=var(TempF),
eLoc2=mean(eLoc**2),
eLocMonth=mean(eLocMonth**2),
eLocMonthHour=mean(eLocMonthHour**2)
) %>%
bind_rows(nullOverall)
# Create plot for RMSE
nullByLocale %>%
pivot_longer(-locNamefct, names_to="byNull", values_to="MSE") %>%
mutate(RMSE=MSE**0.5,
byNull=factor(byNull, levels=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour"))
)%>%
group_by(locNamefct) %>%
mutate(dRMSE=ifelse(row_number()==n(), RMSE, RMSE-lead(RMSE))) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(locNamefct, RMSE), y=dRMSE, fill=byNull)) +
geom_col(position="stack") +
coord_flip() +
scale_fill_discrete("Technique",
breaks=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour"),
labels=c("Overall", "Locale", "Locale-Month", "Locale-Month-Hour"),
guide=guide_legend(reverse=TRUE)
) +
labs(x="", y="RMSE", title="RMSE by Locale by Technique (averages only)") +
theme(legend.position="bottom")
For predicting overall accuracy, improvements in RMSE are most rapid when moving from locale-only to a combination of locale-month. This pattern appears to be generally repeated for most locales.
How much expalantory power do dewpoint and sea-level pressure have after controlling for locale-month-hour?
# Addition of dew point to training data
nullTrain <- nullData %>%
filter(!isTest) %>%
right_join(nullMeans, by=c("locNamefct", "month", "hrfct")) %>%
mutate(err=meanLocMonthHour-TempF)
# Simple linear regression for dew point vs. error
lm(err ~ DewF, data=nullTrain) %>%
summary()
##
## Call:
## lm(formula = err ~ DewF, data = nullTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.077 -4.146 0.557 4.296 38.408
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8913389 0.0448067 131.5 <2e-16 ***
## DewF -0.1272657 0.0008986 -141.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.124 on 182948 degrees of freedom
## Multiple R-squared: 0.09881, Adjusted R-squared: 0.09881
## F-statistic: 2.006e+04 on 1 and 182948 DF, p-value: < 2.2e-16
# Simple linear regression for modSLP vs. error
lm(err ~ modSLP, data=nullTrain) %>%
summary()
##
## Call:
## lm(formula = err ~ modSLP, data = nullTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.254 -4.161 -0.217 3.963 43.079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.892e+02 2.543e+00 -153.1 <2e-16 ***
## modSLP 3.829e-01 2.502e-03 153.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.065 on 182948 degrees of freedom
## Multiple R-squared: 0.1135, Adjusted R-squared: 0.1135
## F-statistic: 2.343e+04 on 1 and 182948 DF, p-value: < 2.2e-16
# Simple linear regression for dew point and modSLP vs. error
lm(err ~ DewF + modSLP, data=nullTrain) %>%
summary()
##
## Call:
## lm(formula = err ~ DewF + modSLP, data = nullTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.274 -4.009 0.352 4.092 39.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.177e+02 2.535e+00 -125.3 <2e-16 ***
## DewF -1.011e-01 8.851e-04 -114.2 <2e-16 ***
## modSLP 3.172e-01 2.485e-03 127.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.826 on 182947 degrees of freedom
## Multiple R-squared: 0.1725, Adjusted R-squared: 0.1725
## F-statistic: 1.907e+04 on 2 and 182947 DF, p-value: < 2.2e-16
At a glance, dew point and sea-level pressure both have meaningful explanatory power. In combination, they explain 17% of the variance in error when predictions are made solely based on locale-month-hour.
Does this change when regressions are run on smaller slices of the data (locale-month)?
nullLM <- nullTrain %>%
nest(data=-c(locNamefct, month)) %>%
mutate(lmResult=map(data, ~lm(err ~ DewF + modSLP, data=.x)),
tidied=map(lmResult, broom::glance)
) %>%
unnest(tidied)
nullLM %>%
ggplot(aes(x=r.squared)) +
geom_histogram(bins=10) +
labs(x="R-squared",
y="# Combinations of Locale-Month",
title="R-squared for error ~ dewpoint + sea-level pressure",
subtitle="Error based on null model of temperature ~ locale + month + hour"
)
nullLM %>%
ggplot(aes(x=fct_reorder(locNamefct, r.squared), y=r.squared)) +
geom_boxplot(fill="lightblue") +
coord_flip() +
labs(x="",
y="R-squared",
title="R-squared for error ~ dewpoint + sea-level pressure (by locale/month)",
subtitle="Error based on null model of temperature ~ locale + month + hour"
)
nullLM %>%
ggplot(aes(x=month, y=r.squared)) +
geom_boxplot(fill="lightblue") +
coord_flip() +
labs(x="",
y="R-squared",
title="R-squared for error ~ dewpoint + sea-level pressure (by locale/month)",
subtitle="Error based on null model of temperature ~ locale + month + hour"
)
Addiing dewpoint and sea-level pressure appears to have better explanatory power in cold-weather locales and in cold-weather months. Explanatory power appears to be lowest in marine and desert locales, and in warm-weather months. In part, this is because there is less overall variation of temperature in these locales, and thus knowing the locale, month, and hour will already drive a very low RMSE. In contrast, RMSE tended to be higher for the cold-weather locales.
The results of the linear regression can then be used to make a new prediction for temperature as follows:
# Merge the lm column to the testing data
nullTestLM <- nullTest %>%
group_by(locNamefct, month) %>%
nest(data=-c(locNamefct, month)) %>%
ungroup() %>%
left_join(select(nullLM, month, locNamefct, lmResult), by=c("month", "locNamefct"))
# Create new predictions in the nested frame
nullTestLM <- nullTestLM %>%
mutate(data=map2(.x=data,
.y=lmResult,
~mutate(.x, errPred=predict(.y, newdata=.x), tempPred=meanLocMonthHour-errPred)
)
)
# Unnest for the actual data
lmNullDewSLP <- nullTestLM %>%
unnest(data)
Calculate and plot the improvements in RMSE:
newOverall <- lmNullDewSLP %>%
mutate(locNamefct="Overall") %>%
group_by(locNamefct) %>%
summarize(eDewSLP=mean((TempF-tempPred)**2))
newByLocale <- lmNullDewSLP %>%
mutate(locNamefct=as.character(locNamefct)) %>%
group_by(locNamefct) %>%
summarize(eDewSLP=mean((TempF-tempPred)**2)) %>%
bind_rows(newOverall) %>%
right_join(nullByLocale, by=c("locNamefct"))
# Create plot for RMSE
newByLocale %>%
select(-eDewSLP, eDewSLP) %>%
pivot_longer(-locNamefct, names_to="byNull", values_to="MSE") %>%
mutate(RMSE=MSE**0.5,
byNull=factor(byNull, levels=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP"))
) %>%
group_by(locNamefct) %>%
mutate(dRMSE=ifelse(row_number()==n(), RMSE, RMSE-lead(RMSE))) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(locNamefct, RMSE, .fun=max),
y=dRMSE,
fill=byNull
)
) +
geom_col(position="stack") +
coord_flip() +
scale_fill_discrete("Technique",
breaks=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP"),
labels=c("Impact of Locale", "Impact of Month", "Impact of Hour",
"Impact of Dew and SLP", "Final"
),
guide=guide_legend(reverse=TRUE)
) +
labs(x="", y="RMSE", title="RMSE by Locale by Technique (averages, then lm for Dew/SLP)") +
theme(legend.position="bottom")
The final RMSE by locale are much more even than the original RMSE by locale, driven by the significant impact of (especially) month and dewpoint/SLP in many of the more tricky locales. Is there a similar finding when running random forests by locale?
startTime <- Sys.time()
# Create a container to hold the output
locsFull2016 <- newByLocale %>%
filter(locNamefct != "Overall") %>%
pull(locNamefct)
lstFull2016mhxx <- vector("list", length(locsFull2016))
names(lstFull2016mhxx) <- locsFull2016
# Filter so that TempF exists and that bizarre outliers cannot occur
regrUse2016 <- regrData %>%
filter(!is.na(TempF), year==2016, TempF <= 120, DewF <= 100, TempF >= DewF)
n <- 1
for (loc in locsFull2016) {
# Run a regression for TempF vs. month, hour
lstFull2016mhxx[[n]] <- rfRegression(regrUse2016,
depVar="TempF",
predVars=c("month", "hrfct"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=loc),
seed=2007231345+n,
ntree=20,
mtry=2,
testSize=0.3
)
# Incerement counter
n <- n + 1
# Update progress
cat("\nFinished processing:", loc)
}
##
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Detroit, MI
## Finished processing: Grand Rapids, MI
## Finished processing: Green Bay, WI
## Finished processing: Houston, TX
## Finished processing: Indianapolis, IN
## Finished processing: Las Vegas, NV
## Finished processing: Lincoln, NE
## Finished processing: Los Angeles, CA
## Finished processing: Madison, WI
## Finished processing: Miami, FL
## Finished processing: Milwaukee, WI
## Finished processing: Minneapolis, MN
## Finished processing: New Orleans, LA
## Finished processing: Newark, NJ
## Finished processing: Philadelphia, PA
## Finished processing: Phoenix, AZ
## Finished processing: Saint Louis, MO
## Finished processing: San Antonio, TX
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: San Jose, CA
## Finished processing: Seattle, WA
## Finished processing: Tampa Bay, FL
## Finished processing: Traverse City, MI
## Finished processing: Washington, DC
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
##
## Total run time: 40.9 seconds
The MSE can then be calculated off each of the testData files, with comparisons made to the null data:
mseFull2016mhxx <- tibble::tibble(locNamefct=names(lstFull2016mhxx),
mse=sapply(lstFull2016mhxx,
FUN=function(x) {
mean((x$testData$TempF-x$testData$predicted)**2)
}
)
)
newByLocale %>%
inner_join(mseFull2016mhxx, by="locNamefct") %>%
mutate(delta=mse**0.5-eLocMonthHour**0.5) %>%
ggplot(aes(x=fct_reorder(locNamefct, -delta), y=delta)) +
geom_col(fill="lightblue") +
coord_flip() +
labs(x="",
y="Change in RMSE (random forest RMSE minus null RMSE)",
title="Impact on RMSE of random forest vs. null model"
)
Broadly speaking, then random forest regression and the null regression achieve very similar RMSE. Some locales are predicted ~0.25 degrees better while others are predicted ~0.25 degrees worse. Does this pattern hold when adding dewpoint and sea-level-pressure to the random forest regression?
startTime <- Sys.time()
# Create a container to hold the output
locsFull2016 <- newByLocale %>%
filter(locNamefct != "Overall") %>%
pull(locNamefct)
lstFull2016mhds <- vector("list", length(locsFull2016))
names(lstFull2016mhds) <- locsFull2016
# Filter so that TempF exists and that bizarre outliers cannot occur
regrUse2016 <- regrData %>%
filter(!is.na(TempF), year==2016, TempF <= 120, DewF <= 100, TempF >= DewF)
n <- 1
for (loc in locsFull2016) {
# Run a regression for TempF vs. month, hour
lstFull2016mhds[[n]] <- rfRegression(regrUse2016,
depVar="TempF",
predVars=c("month", "hrfct", "DewF", "modSLP"),
otherVar=otherVars,
critFilter=list(year=2016, locNamefct=loc),
seed=2007231345+n,
ntree=20,
mtry=3,
testSize=0.3
)
# Incerement counter
n <- n + 1
# Update progress
cat("\nFinished processing:", loc)
}
##
## Finished processing: Atlanta, GA
## Finished processing: Boston, MA
## Finished processing: Chicago, IL
## Finished processing: Dallas, TX
## Finished processing: Denver, CO
## Finished processing: Detroit, MI
## Finished processing: Grand Rapids, MI
## Finished processing: Green Bay, WI
## Finished processing: Houston, TX
## Finished processing: Indianapolis, IN
## Finished processing: Las Vegas, NV
## Finished processing: Lincoln, NE
## Finished processing: Los Angeles, CA
## Finished processing: Madison, WI
## Finished processing: Miami, FL
## Finished processing: Milwaukee, WI
## Finished processing: Minneapolis, MN
## Finished processing: New Orleans, LA
## Finished processing: Newark, NJ
## Finished processing: Philadelphia, PA
## Finished processing: Phoenix, AZ
## Finished processing: Saint Louis, MO
## Finished processing: San Antonio, TX
## Finished processing: San Diego, CA
## Finished processing: San Francisco, CA
## Finished processing: San Jose, CA
## Finished processing: Seattle, WA
## Finished processing: Tampa Bay, FL
## Finished processing: Traverse City, MI
## Finished processing: Washington, DC
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
##
## Total run time: 113.9 seconds
The MSE can then be calculated off each of the testData files, with comparisons made to the original data that combined null means with a linear model of dewpoint and SLP on the residuals:
mseFull2016mhds <- tibble::tibble(locNamefct=names(lstFull2016mhds),
mse=sapply(lstFull2016mhds,
FUN=function(x) {
mean((x$testData$TempF-x$testData$predicted)**2)
}
)
)
newByLocale %>%
inner_join(mseFull2016mhds, by="locNamefct") %>%
mutate(delta=mse**0.5-eDewSLP**0.5) %>%
ggplot(aes(x=fct_reorder(locNamefct, -delta), y=delta)) +
geom_col(fill="lightblue") +
coord_flip() +
labs(x="",
y="Change in RMSE (random forest RMSE minus null+linear RMSE)",
title="Impact on RMSE of random forest vs. null+linear model"
)
sapply(lstFull2016mhds, FUN=function(x) { x$rsq }) %>%
as.data.frame() %>%
mutate(ntree=1:n()) %>%
tibble::as_tibble() %>%
pivot_longer(-ntree, names_to="locale", values_to="rsq") %>%
group_by(ntree) %>%
summarize(meanrsq=mean(rsq)) %>%
ggplot(aes(x=ntree, y=meanrsq)) +
geom_point() +
geom_text(aes(y=meanrsq+0.005, label=round(meanrsq, 3))) +
ylim(c(NA, 1)) +
labs(x="# Trees", y="Mean R-squared by locale", title="Impact on R-squared of forest size")
The power of the random forest begins to appear, as RMSE is consistently lower by roughly ~0.5 degrees using the random forest regressions rather than the null means with a basic linear model. Importantly, the random forest regression is able to consider hour, dewpoint, and SLP together for a given locale-month, while the null model used a constant impact of dewpoint and SLP for every hour in the locale-month. Further, the random forest regression can potentially tease out non-linear patterns in the data.
Further, the random forest regression model, while already performing better, is continuing to learn slightly even at around 20 trees. This suggests that a larger forest might potentially drive slightly more prediction improvements.
The progression of RMSE by locale is plotted again:
mseFull2016Overall <- map_dfr(lstFull2016mhds, .f=function(x) { x$testData }) %>%
summarize(err2=mean((TempF-predicted)**2)) %>%
pull(err2)
mseFull2016mhds <- mseFull2016mhds %>%
bind_rows(tibble::tibble(locNamefct="Overall", mse=mseFull2016Overall))
mseByLocale <- newByLocale %>%
inner_join(mseFull2016mhds, by=c("locNamefct")) %>%
rename(rf=mse)
# Create plot for RMSE
mseByLocale %>%
select(-eDewSLP, -rf, eDewSLP, rf) %>%
pivot_longer(-locNamefct, names_to="byNull", values_to="MSE") %>%
mutate(RMSE=MSE**0.5,
byNull=factor(byNull, levels=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP", "rf"))
) %>%
group_by(locNamefct) %>%
mutate(dRMSE=ifelse(row_number()==n(), RMSE, RMSE-lead(RMSE))) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(locNamefct, RMSE, .fun=max),
y=dRMSE,
fill=byNull
)
) +
geom_col(position="stack") +
coord_flip() +
scale_fill_discrete("Technique",
breaks=c("varTot", "eLoc2", "eLocMonth", "eLocMonthHour", "eDewSLP", "rf"),
labels=c("Impact of Locale", "Impact of Month", "Impact of Hour",
"Impact of Dew and SLP", "Impact of Forest", "Final"
),
guide=guide_legend(reverse=TRUE)
) +
labs(x="", y="RMSE", title="RMSE by Locale by Technique (averages, then lm for Dew/SLP, then forest)") +
theme(legend.position="bottom")
The modest but meaningful impact of moving to the random forest regression is apparent.
A function is written to integrate testData files from multiple regressions, categorize them, and plot the resulting RMSE. Functions helperGetRMSE() and plotRMSEEvolution() are in WeatherModelingFunctions_v001.R. The functions are then run to compare the categorical-only forest and the categorical-plus-numerical forest:
# Impact of adding dewpoint and SLP to the model
rmseCatvNum <- plotRMSEEvolution(list(catOnly=lstFull2016mhxx, catNum=lstFull2016mhds),
depVar="TempF",
mapper=c("catOnly"="Impact of Loc-Month-Hour",
"catNum"="Impact of Dew-SLP"
)
)
This should simplify RMSE comparisons, and the output file can also be used to calculate R-squared:
rmseCatvNum %>%
group_by(locNamefct) %>%
mutate(rsq=1-rmse**2/max(rmse)**2) %>%
ggplot(aes(x=fct_reorder(locNamefct, rsq, .fun=max), y=rsq, color=mainList, group=mainList)) +
geom_point() +
coord_flip() +
labs(x="", y="R-squared", title="Evolution of R-squared by locale") +
scale_color_discrete("",
breaks=c("totVar", "catOnly", "catNum"),
labels=c("No Model", "Categorical Only", "Categorical and Numerical")
) +
ylim(c(0, 1)) +
theme(legend.position="bottom") +
geom_text(data=~filter(., mainList=="catNum"),
aes(y=rsq+0.01, label=round(rsq, 3)),
hjust=0, size=3.5, fontface="bold"
)
Final predictions are less accurate (higher RMSE) in cold-weather locales, but the amount of variance explained (R-squared) is higher in cold weather locales due to the very high starting variance that is significantly reduced by the modeling.
Next, random forest models, run separately by locale, are created for:
startTime <- Sys.time()
# Variable combinations for running random forests
varCombos <- list(cmb1=c("locNamefct"),
cmb2=c("locNamefct", "month"),
cmb3=c("locNamefct", "month", "hrfct"),
cmb4=c("locNamefct", "month", "hrfct", "DewF"),
cmb5=c("locNamefct", "month", "hrfct", "DewF", "modSLP"),
cmb6=c("locNamefct", "month", "hrfct", "DewF", "modSLP", "Altimeter"),
cmb7=c("locNamefct", "month", "hrfct", "DewF", "modSLP", "Altimeter",
"WindSpeed", "predomDir", "minHeight", "ceilingHeight"
)
)
# Variables to be included in all final testData sets
keepVars <- c("source", "dtime", "locNamefct", "month", "hrfct", "DewF",
"modSLP", "Altimeter", "WindSpeed", "predomDir", "minHeight", "ceilingHeight"
)
# Create an overall container to hold the output
lstRFRegCombos <- vector("list", length(varCombos))
# Run through every combination of variables, and run every locale once inside that
nCombo <- 1
for (varCombo in varCombos) {
cat("\nBeginning to process for variables:", varCombo)
# Initialize a list container to be used inside this element of the main list
lstRFRegCombos[[nCombo]] <- vector("list", length(locsFull2016))
# Create mtry based on length(varCombo)
mtry <- min(3, length(varCombo)) + max(0, ceiling((length(varCombo)-4)/3))
cat("\nmtry will be set to:", mtry)
# Create ntree based on length(varCombo)
ntree <- max(5, min(50, ceiling(0.8 * length(varCombo)**2)))
cat("\nntree will be set to:", ntree)
# Run once for each locale
nLoc <- 1
for (loc in locsFull2016) {
# Run a regression for TempF vs. varCombo
lstRFRegCombos[[nCombo]][[nLoc]] <- rfRegression(regrUse2016,
depVar="TempF",
predVars=varCombo,
otherVar=keepVars,
critFilter=list(year=2016, locNamefct=loc),
seed=2007251330,
ntree=ntree,
mtry=mtry,
testSize=0.3
)
# Incerement counter
nLoc <- nLoc + 1
# Announce progress
if ((nLoc %% 5) == 0) cat("\nFinished procesing for:", loc)
}
# Incerement counter
nCombo <- nCombo + 1
# Update progress
cat("\nFinished processing:", varCombo)
}
##
## Beginning to process for variables: locNamefct
## mtry will be set to: 1
## ntree will be set to: 5
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct
## Beginning to process for variables: locNamefct month
## mtry will be set to: 2
## ntree will be set to: 5
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month
## Beginning to process for variables: locNamefct month hrfct
## mtry will be set to: 3
## ntree will be set to: 8
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct
## Beginning to process for variables: locNamefct month hrfct DewF
## mtry will be set to: 3
## ntree will be set to: 13
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF
## Beginning to process for variables: locNamefct month hrfct DewF modSLP
## mtry will be set to: 4
## ntree will be set to: 20
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF modSLP
## Beginning to process for variables: locNamefct month hrfct DewF modSLP Altimeter
## mtry will be set to: 4
## ntree will be set to: 29
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF modSLP Altimeter
## Beginning to process for variables: locNamefct month hrfct DewF modSLP Altimeter WindSpeed predomDir minHeight ceilingHeight
## mtry will be set to: 5
## ntree will be set to: 50
## Finished procesing for: Dallas, TX
## Finished procesing for: Houston, TX
## Finished procesing for: Madison, WI
## Finished procesing for: Newark, NJ
## Finished procesing for: San Diego, CA
## Finished procesing for: Traverse City, MI
## Finished processing: locNamefct month hrfct DewF modSLP Altimeter WindSpeed predomDir minHeight ceilingHeight
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
##
## Total run time: 607.8 seconds
Results can then be plotted by picking elements of the main list:
# Key list to run through
# Locale (1)
# Locale-Month-Hour (3)
# Locale-Month-Hour-Dew-SLP (5)
# Everything (7)
keyListPlot <- list(locOnly=lstRFRegCombos[[1]],
locmh=lstRFRegCombos[[3]],
locmhds=lstRFRegCombos[[5]],
locevery=lstRFRegCombos[[7]]
)
# Impact of evolving the model
rmseCombos <- plotRMSEEvolution(keyListPlot,
depVar="TempF",
mapper=c("locOnly"="Impact of Locale",
"locmh"="Impact of Month-Hour",
"locmhds"="Impact of Dew-SLP",
"locevery"="Impact of all other"
)
)
# Plot the evolution of R-squared based on the RMSE data
rmseCombos %>%
group_by(locNamefct) %>%
mutate(rsq=1-rmse**2/max(rmse)**2) %>%
ggplot(aes(x=fct_reorder(locNamefct, rsq, .fun=max), y=rsq, color=mainList, group=mainList)) +
geom_point() +
coord_flip() +
labs(x="", y="R-squared", title="Evolution of R-squared by locale") +
scale_color_discrete("",
breaks=c("totVar", "locOnly", "locmh", "locmhds", "locevery"),
labels=c("No Model", "Locale Only", "Locale-Month-Hour",
"Locale-Month-Hour-Dew-SLP", "Everything"
)
) +
ylim(c(0, 1)) +
theme(legend.position="bottom") +
geom_text(data=~filter(., mainList=="locevery"),
aes(y=rsq+0.01, label=round(rsq, 3)),
hjust=0, size=3.5, fontface="bold"
)
R-squared is generally in the 95%+ regime, with the exceptions of the Pacific coast, deserts, Denver, and Florida. Denver and the deserts stand out as interesting since they also have among the highest remaining RMSE. In contrast, the low R-squared for the Pacific coast and Florida is based in part on the low underlying temperature variations for these locales.
Next, the models are used to make predictions on data from the same locales but different years:
# Create the list of locales that have data in multiple years
oosLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")
# Pull a dataset that has relevant data
oosData <- regrData %>%
filter(locNamefct %in% oosLocs,
!is.na(TempF)
)
# Create a container to hold results by locale
oosList <- vector("list", length(oosLocs))
names(oosList) <- oosLocs
# Run for each locale
nLoc <- 1
for (loc in oosLocs) {
# Filter to only data for this locale
oosHere <- oosData %>%
filter(locNamefct==loc)
# Get the position for the locale variable
locPosn <- match(loc, locsFull2016)
cat("\nMaking predictions for:", loc, "which is in position:", locPosn)
# Make an appropriate sized list for holding all the predictions
preds <- vector("list", length(lstRFRegCombos))
# Make predictions for each model
nModel <- 1
for (mdl in lstRFRegCombos) {
depVars <- mdl[[locPosn]]$rfModel$importance %>% row.names()
preds[[nModel]] <- predict(mdl[[locPosn]]$rfModel, newdata=oosHere[, depVars])
nModel <- nModel + 1
}
# Output the model data file with predictions
oosList[[nLoc]] <- oosHere %>%
bind_cols(map_dfc(preds, .f=function(x) x))
nLoc <- nLoc + 1
}
##
## Making predictions for: Chicago, IL which is in position: 3
## Making predictions for: Las Vegas, NV which is in position: 11
## Making predictions for: New Orleans, LA which is in position: 18
## Making predictions for: San Diego, CA which is in position: 24
Performance by year and model can be assessed:
# Hard code for the number of models
nModels <- 7
# Combine the datasets in oosList
tblPreds <- map_dfr(oosList, .f=function(x) x, .id="listName")
# Create the error for each scenario
for (intCtr in 1:nModels) {
tblPreds <- tblPreds %>%
mutate(err=get(paste0("V", intCtr))-TempF) %>%
rename_at("err", ~paste0("err", intCtr))
}
# Create RMSE by locale and year
oosRMSE <- tblPreds %>%
group_by(locNamefct, year) %>%
summarize_at(vars(starts_with("err")), .f=function(x) mean(x**2)**0.5) %>%
pivot_longer(-c(locNamefct, year), names_to="model", values_to="rmse") %>%
filter(model %in% c("err1", "err3", "err5", "err7")) %>%
mutate(model=case_when(model=="err1" ~ "loc only", model=="err3" ~ "add mon/hr",
model=="err5" ~ "add dew/SLP", model=="err7" ~ "add all other",
TRUE~model
)
)
# Create plot of RMSE by locale by year
oosRMSE %>%
ggplot(aes(x=fct_reorder(model, -rmse), y=rmse, fill=factor(year))) +
geom_col(position="dodge") +
facet_wrap(~locNamefct) +
labs(title="RMSE evolution for models built using 2016 data", x="", y="RMSE") +
scale_fill_discrete("")
# Create base R-squared by locale and year
baseRMSE <- oosData %>%
group_by(locNamefct, year) %>%
summarize(basermse=sd(TempF))
# Combine with base RMSE data
oosRMSE %>%
left_join(baseRMSE, by=c("locNamefct", "year")) %>%
mutate(rsq=1-rmse**2/basermse**2) %>%
ggplot(aes(x=factor(year), y=rsq, fill=fct_reorder(model, rsq))) +
geom_col(position="dodge") +
facet_wrap(~locNamefct) +
labs(title="R-squared evolution for models built using 2016 data", x="", y="R-squared") +
scale_fill_discrete("") +
coord_flip()
The 2016 models perform better on the 2016 data, though there continues to be reduction in RMSE with more complex models in every year (with the exception of Las Vegas). This suggests the model is partially learning features specific to the year 2016 and partially learning features specific to the locale. Creating models based on a larger sample of data (more years) might help generalize the findings.
Next, models are run for the 2014-2019 data for the locales that have data availability (Chicago, IL; Las Vegas, NV; New Orleans, LA; San Diego, CA):
# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")
# Create a main list, one per locale
lstFullData <- vector("list", length(fullDataLocs))
# Create a list of relevant dependent variables and variables to keep
depVarFull <- c('hrfct', 'DewF', 'modSLP', 'Altimeter', 'WindSpeed',
'predomDir', 'minHeight', 'ceilingHeight'
)
keepVarFull <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct', 'DewF', 'modSLP',
'Altimeter', 'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
)
# Run the regressions by locale and month
nLoc <- 1
for (loc in fullDataLocs) {
# Pull data for only this locale, and where TempF is not missing
pullData <- regrData %>%
filter(locNamefct==loc, !is.na(TempF))
# Create the months to be run
fullDataMonths <- pullData %>%
count(month) %>%
pull(month)
# Create containers for each run
lstFullData[[nLoc]] <- vector("list", length(fullDataMonths))
# Run random forest regression for each month for the locale
cat("\nBeginning to process:", loc)
nMonth <- 1
for (mon in fullDataMonths) {
# Run the regression
lstFullData[[nLoc]][[nMonth]] <- rfRegression(pullData,
depVar="TempF",
predVars=depVarFull,
otherVar=keepVarFull,
critFilter=list(locNamefct=loc, month=mon),
seed=2007271252,
ntree=100,
mtry=4,
testSize=0.3
)
# Increment the counter
nMonth <- nMonth + 1
cat("\nFinished month:", mon)
}
# Incerement the counter
nLoc <- nLoc + 1
}
##
## Beginning to process: Chicago, IL
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: Las Vegas, NV
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: New Orleans, LA
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
## Beginning to process: San Diego, CA
## Finished month: Jan
## Finished month: Feb
## Finished month: Mar
## Finished month: Apr
## Finished month: May
## Finished month: Jun
## Finished month: Jul
## Finished month: Aug
## Finished month: Sep
## Finished month: Oct
## Finished month: Nov
## Finished month: Dec
The relevant ‘testData’ files can then be combined for an assessment of overall prediction accuracy:
# Helper function to extract testData from inner list
combineTestData <- function(lst, elem="testData") {
map_dfr(lst, .f=function(x) x[[elem]])
}
# Combine all of the test data files
fullTestData <- map_dfr(lstFullData, .f=combineTestData) %>%
mutate(err=predicted-TempF,
year=factor(year)
)
# Helper function to create RMSE data
helperCreateRMSE <- function(df, byVar, depVar, errVar="err") {
df %>%
group_by_at(vars(all_of(byVar))) %>%
summarize(varTot=var(get(depVar)), varModel=mean(get(errVar)**2)) %>%
mutate(rmseTot=varTot**0.5, rmseModel=varModel**0.5, rsq=1-varModel/varTot)
}
# Create plot for a given by-variable and facet-variable
helperRMSEPlot <- function(df, byVar, depVar, facetVar=NULL) {
# Create a copy of the original by variable
byVarOrig <- byVar
# Expand byVar to include facetVar if facetVar is not null
if (!is.null(facetVar)) {
byVar <- unique(c(byVar, facetVar))
}
# Create
p1 <- df %>%
helperCreateRMSE(byVar=byVar, depVar=depVar) %>%
select_at(vars(all_of(c(byVar, "rmseTot", "rmseModel")))) %>%
pivot_longer(c(rmseTot, rmseModel), names_to="model", values_to="rmse") %>%
group_by_at(vars(all_of(byVar))) %>%
mutate(dRMSE=ifelse(row_number()==n(), rmse, rmse-lead(rmse)),
model=factor(model, levels=c("rmseTot", "rmseModel"))
) %>%
ggplot(aes_string(x=byVarOrig, y="dRMSE", fill="model")) +
geom_col() +
geom_text(data=~filter(., model=="rmseModel"), aes(y=dRMSE/2, label=round(dRMSE, 1))) +
coord_flip() +
labs(x="", y="RMSE", title="RMSE before and after modelling") +
scale_fill_discrete("",
breaks=c("rmseModel", "rmseTot"),
labels=c("Final", "Explained by Model")
) +
theme(legend.position="bottom")
# Add facetting if the argument was passed
if (!is.null(facetVar)) { p1 <- p1 + facet_wrap(as.formula(paste("~", facetVar))) }
print(p1)
}
# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData, byVar="locNamefct", depVar="TempF")
helperRMSEPlot(fullTestData, byVar="year", depVar="TempF")
helperRMSEPlot(fullTestData, byVar="month", depVar="TempF")
# Facetted by locale
helperRMSEPlot(fullTestData, byVar="year", depVar="TempF", facetVar="locNamefct")
helperRMSEPlot(fullTestData, byVar="month", depVar="TempF", facetVar="locNamefct")
Further, an overall decline in MSE can be estimated as the average of the MSE declines in each locale-month:
# Function to extract MSE data from inner lists
helperMSETibble <- function(x) {
map_dfr(x, .f=function(y) tibble::tibble(ntree=1:length(y$mse), mse=y$mse))
}
map_dfr(lstFullData, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
group_by(source, ntree) %>%
summarize(meanmse=mean(mse)) %>%
ungroup() %>%
mutate(source=fullDataLocs[as.integer(source)]) %>%
ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) +
geom_line() +
ylim(c(0, NA)) +
labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")
At 100 trees, the model appears to have largely completed learning, with no more material declines in MSE. Overall, model predictions average 3-4 degrees different from actual temperatures. Deviations are greater in Las Vegas (4-5 degrees), and in the spring in Chicago (4-5 degrees). Deviations are lesser in San Diego (2-3 degrees) and winter in Chicago (2-3 degrees).
The model is then run for all months combined for a single locale, to compare results when month is a trained explanatory variable rather than a segment modelled separately:
# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")
# Create a main list, one per locale
lstFullData_002 <- vector("list", length(fullDataLocs))
# Create a list of relevant dependent variables and variables to keep
depVarFull_002 <- c('month', 'hrfct', 'DewF', 'modSLP',
'Altimeter', 'WindSpeed', 'predomDir',
'minHeight', 'ceilingHeight'
)
keepVarFull_002 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct',
'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir',
'minHeight', 'ceilingHeight'
)
# Run the regressions by locale and month
nLoc <- 1
for (loc in fullDataLocs) {
# Pull data for only this locale, and where TempF is not missing
pullData <- regrData %>%
filter(locNamefct==loc, !is.na(TempF))
# To be parallel with previous runs, make a length-one list inside locale
lstFullData_002[[nLoc]] <- vector("list", 1)
# Run random forest regression for each locale
cat("\nBeginning to process:", loc)
lstFullData_002[[nLoc]][[1]] <- rfRegression(pullData,
depVar="TempF",
predVars=depVarFull_002,
otherVar=keepVarFull_002,
critFilter=list(locNamefct=loc),
seed=2007281307,
ntree=25,
mtry=4,
testSize=0.3
)
# Incerement the counter
nLoc <- nLoc + 1
}
##
## Beginning to process: Chicago, IL
## Beginning to process: Las Vegas, NV
## Beginning to process: New Orleans, LA
## Beginning to process: San Diego, CA
The results can then be compared to the results of the regressions run using month as a segment:
# Combine all of the test data files
fullTestData_002 <- map_dfr(lstFullData_002, .f=combineTestData) %>%
mutate(err=predicted-TempF,
year=factor(year)
)
# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_002, byVar="locNamefct", depVar="TempF")
helperRMSEPlot(fullTestData_002, byVar="year", depVar="TempF")
helperRMSEPlot(fullTestData_002, byVar="month", depVar="TempF")
# Facetted by locale
helperRMSEPlot(fullTestData_002, byVar="year", depVar="TempF", facetVar="locNamefct")
helperRMSEPlot(fullTestData_002, byVar="month", depVar="TempF", facetVar="locNamefct")
# Evolution of RMSE
map_dfr(lstFullData_002, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
group_by(source, ntree) %>%
summarize(meanmse=mean(mse)) %>%
ungroup() %>%
mutate(source=fullDataLocs[as.integer(source)]) %>%
ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) +
geom_line() +
ylim(c(0, NA)) +
labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")
The prediction qualities and evolution of MSE by number of trees look broadly similar to the results run by locale-month. Notably, month scores high on variable importance:
impList <- lapply(lstFullData_002, FUN=function(x) {
locName <- x[[1]]$testData$locNamefct %>% as.character() %>% unique()
x[[1]]$rfModel$importance %>%
as.data.frame() %>%
rownames_to_column("variable") %>%
rename_at(vars(all_of("IncNodePurity")), ~locName) %>%
tibble::as_tibble()
}
)
impDF <- Reduce(function(x, y) merge(x, y, all=TRUE), impList)
# Overall variable importance
impDF %>%
pivot_longer(-variable, names_to="locale", values_to="incPurity") %>%
ggplot(aes(x=fct_reorder(varMapper[variable], incPurity), y=incPurity)) +
geom_col() +
coord_flip() +
facet_wrap(~locale) +
labs(x="", y="Importance", title="Variable Importance by Locale")
# Relative variable importance
impDF %>%
pivot_longer(-variable, names_to="locale", values_to="incPurity") %>%
group_by(locale) %>%
mutate(incPurity=incPurity/sum(incPurity)) %>%
ggplot(aes(x=fct_reorder(varMapper[variable], incPurity), y=incPurity)) +
geom_col() +
coord_flip() +
facet_wrap(~locale) +
labs(x="", y="Relative Importance", title="Relative Variable Importance by Locale")
There is much more underlying variance in the Chicago data, thus greater overall variable importance in Chicago. On a relative basis, locale predictions are driven by:
It is interesting to see the similarities in Chicago and New Orleans, with both having strong explanatory power from the combination of dew point and month, despite meaningfully different climates. As in previous analyses, Las Vegas and San Diego look different from each other and also different from Chicago/New Orleans.
Finally, an attempt is made to model temperature using locale as an explanatory variable:
# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")
# Create a main list, one for overall
lstFullData_003 <- vector("list", 1)
# Create a list of relevant dependent variables and variables to keep
depVarFull_003 <- c('locNamefct', 'month', 'hrfct',
'DewF', 'modSLP', 'Altimeter',
'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
)
keepVarFull_003 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct',
'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir',
'minHeight', 'ceilingHeight'
)
# Run the overall regressions
# Pull data for only the specified locales, and where TempF is not missing
pullData <- regrData %>%
filter(locNamefct %in% fullDataLocs, !is.na(TempF))
# To be parallel with previous runs, make a length-one list inside locale
lstFullData_003[[1]] <- vector("list", 1)
startTime <- Sys.time()
# Run random forest regression (overall only)
lstFullData_003[[1]][[1]] <- rfRegression(pullData,
depVar="TempF",
predVars=depVarFull_003,
otherVar=keepVarFull_003,
critFilter=list(locNamefct=fullDataLocs),
seed=2007291317,
ntree=2,
mtry=4,
testSize=0.3
)
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
##
## Total run time: 1121.1 seconds
The results can then be compared to the results of the regressions run using locale as a segment:
# Combine all of the test data files
fullTestData_003 <- map_dfr(lstFullData_003, .f=combineTestData) %>%
mutate(err=predicted-TempF,
year=factor(year)
)
# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_003, byVar="locNamefct", depVar="TempF")
helperRMSEPlot(fullTestData_003, byVar="year", depVar="TempF")
helperRMSEPlot(fullTestData_003, byVar="month", depVar="TempF")
# Facetted by locale
helperRMSEPlot(fullTestData_003, byVar="year", depVar="TempF", facetVar="locNamefct")
helperRMSEPlot(fullTestData_003, byVar="month", depVar="TempF", facetVar="locNamefct")
# Evolution of RMSE
map_dfr(lstFullData_003, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
group_by(source, ntree) %>%
summarize(meanmse=mean(mse)) %>%
ungroup() %>%
mutate(source="Overall") %>%
ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) +
geom_line() +
ylim(c(0, NA)) +
labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")
Run time is very significantly longer when using locale as an explanatory variable rather than running separate models by locale. Splitting the models by locale and/or month may create some complexities for summary statistics and some risks of over-fitting. But, the risks may be more than offset by the much lower run times.
What if the model is run on a very small subset of the data (e.g., switch test data from 30% of the sample to 90% of the sample)?
# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")
# Create a main list, one for overall
lstFullData_004 <- vector("list", 1)
# Create a list of relevant dependent variables and variables to keep
depVarFull_004 <- c('locNamefct', 'month', 'hrfct',
'DewF', 'modSLP', 'Altimeter',
'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
)
keepVarFull_004 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct',
'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir',
'minHeight', 'ceilingHeight'
)
# Run the overall regressions
# Pull data for only the specified locales, and where TempF is not missing
pullData <- regrData %>%
filter(locNamefct %in% fullDataLocs, !is.na(TempF))
# To be parallel with previous runs, make a length-one list inside locale
lstFullData_004[[1]] <- vector("list", 1)
startTime <- Sys.time()
# Run random forest regression (overall only)
lstFullData_004[[1]][[1]] <- rfRegression(pullData,
depVar="TempF",
predVars=depVarFull_004,
otherVar=keepVarFull_004,
critFilter=list(locNamefct=fullDataLocs),
seed=2007291412,
ntree=100,
mtry=4,
testSize=0.9
)
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
##
## Total run time: 338 seconds
The results can then be compared to the results of the regressions run using locale as a segment:
# Combine all of the test data files
fullTestData_004 <- map_dfr(lstFullData_004, .f=combineTestData) %>%
mutate(err=predicted-TempF,
year=factor(year)
)
# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_004, byVar="locNamefct", depVar="TempF")
helperRMSEPlot(fullTestData_004, byVar="year", depVar="TempF")
helperRMSEPlot(fullTestData_004, byVar="month", depVar="TempF")
# Facetted by locale
helperRMSEPlot(fullTestData_004, byVar="year", depVar="TempF", facetVar="locNamefct")
helperRMSEPlot(fullTestData_004, byVar="month", depVar="TempF", facetVar="locNamefct")
# Evolution of RMSE
map_dfr(lstFullData_004, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
group_by(source, ntree) %>%
summarize(meanmse=mean(mse)) %>%
ungroup() %>%
mutate(source="Overall") %>%
ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) +
geom_line() +
ylim(c(0, NA)) +
labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")
Run time is dramatically faster, with 100 trees on the 10% sample growing in less time than needed for 2 trees to be grown on 70% of the sample. The size of the dataset may be driving run-time challenges more than the complexity of the model.
There is evidence that the smaller training dataset is hurting model predictions. In contrast to previous models using 70% of the data, the model using only 10% of the data appears to be 1 degree less accurate in making predictions on unseen data.
The test data size is decreased to 80% to check the impact:
# Create list of locations
fullDataLocs <- c("Chicago, IL", "Las Vegas, NV", "New Orleans, LA", "San Diego, CA")
# Create a main list, one for overall
lstFullData_005 <- vector("list", 1)
# Create a list of relevant dependent variables and variables to keep
depVarFull_005 <- c('locNamefct', 'month', 'hrfct',
'DewF', 'modSLP', 'Altimeter',
'WindSpeed', 'predomDir', 'minHeight', 'ceilingHeight'
)
keepVarFull_005 <- c('source', 'dtime', 'locNamefct', 'year', 'month', 'hrfct',
'DewF', 'modSLP', 'Altimeter', 'WindSpeed', 'predomDir',
'minHeight', 'ceilingHeight'
)
# Run the overall regressions
# Pull data for only the specified locales, and where TempF is not missing
pullData <- regrData %>%
filter(locNamefct %in% fullDataLocs, !is.na(TempF))
# To be parallel with previous runs, make a length-one list inside locale
lstFullData_005[[1]] <- vector("list", 1)
startTime <- Sys.time()
# Run random forest regression (overall only)
lstFullData_005[[1]][[1]] <- rfRegression(pullData,
depVar="TempF",
predVars=depVarFull_005,
otherVar=keepVarFull_005,
critFilter=list(locNamefct=fullDataLocs),
seed=2007291425,
ntree=25,
mtry=4,
testSize=0.8
)
endTime <- Sys.time()
cat("\nTotal run time:", round(as.numeric(difftime(endTime, startTime, units="secs")), 1), "seconds\n")
##
## Total run time: 925.8 seconds
The results can then be compared to the results of the regressions run using locale as a segment:
# Combine all of the test data files
fullTestData_005 <- map_dfr(lstFullData_005, .f=combineTestData) %>%
mutate(err=predicted-TempF,
year=factor(year)
)
# Stand-alone on three main dimensions
helperRMSEPlot(fullTestData_005, byVar="locNamefct", depVar="TempF")
helperRMSEPlot(fullTestData_005, byVar="year", depVar="TempF")
helperRMSEPlot(fullTestData_005, byVar="month", depVar="TempF")
# Facetted by locale
helperRMSEPlot(fullTestData_005, byVar="year", depVar="TempF", facetVar="locNamefct")
helperRMSEPlot(fullTestData_005, byVar="month", depVar="TempF", facetVar="locNamefct")
# Evolution of RMSE
map_dfr(lstFullData_005, .f=function(x) { helperMSETibble(x) }, .id="source") %>%
group_by(source, ntree) %>%
summarize(meanmse=mean(mse)) %>%
ungroup() %>%
mutate(source="Overall") %>%
ggplot(aes(x=ntree, y=meanmse, group=source, color=source)) +
geom_line() +
ylim(c(0, NA)) +
labs(x="# Trees", y="MSE", title="Evolution of Average MSE by Number of Trees")
This model splits the differences, being about 0.5 degrees worse than the 70% train data model run by locale and about 0.5 degrees better than the 10% train data model run overall. This supports the blend of art and science necessary for successfully running random forest models.
Next, some simulated data is prepared for investigating the impact of random forest hyperparameters on the quality of the output. In this case, a simple formula will be used where y = x1*x2 + x3. Variables x1 and x2 will be from a Poisson distribution while variable x3 will be from a normal distribution. Variables x4 (normal, correlated to -x1), x5 (random normal variable), and x6 (random factor variable) are also added:
nObs <- 5000
set.seed(2007301252)
rfRnd <- tibble::tibble(x1=rpois(nObs, lambda=4),
x2=rpois(nObs, lambda=6),
x3=rnorm(nObs),
x4=rnorm(nObs, mean=-x1),
x5=rnorm(nObs, sd=3),
x6=factor(sample(letters[1:3], nObs, replace=TRUE), levels=letters[1:3]),
y=x1*x2 + x3
)
# Summary statistics
summary(rfRnd)
## x1 x2 x3 x4
## Min. : 0.000 Min. : 0.000 Min. :-3.361358 Min. :-14.067
## 1st Qu.: 3.000 1st Qu.: 4.000 1st Qu.:-0.662320 1st Qu.: -5.441
## Median : 4.000 Median : 6.000 Median :-0.016804 Median : -3.875
## Mean : 3.976 Mean : 5.975 Mean :-0.006134 Mean : -3.985
## 3rd Qu.: 5.000 3rd Qu.: 7.000 3rd Qu.: 0.648586 3rd Qu.: -2.377
## Max. :13.000 Max. :17.000 Max. : 3.578669 Max. : 2.197
## x5 x6 y
## Min. :-10.33459 a:1709 Min. : -2.366
## 1st Qu.: -2.05727 b:1719 1st Qu.: 11.772
## Median : -0.01076 c:1572 Median : 20.357
## Mean : -0.03043 Mean : 23.770
## 3rd Qu.: 1.95468 3rd Qu.: 31.888
## Max. : 11.66095 Max. :117.253
# Correlations
rfRnd %>%
select(-x6) %>%
cor() %>%
round(2)
## x1 x2 x3 x4 x5 y
## x1 1.00 0.00 0.01 -0.90 0.01 0.74
## x2 0.00 1.00 0.00 0.00 -0.02 0.60
## x3 0.01 0.00 1.00 0.00 -0.02 0.07
## x4 -0.90 0.00 0.00 1.00 -0.02 -0.66
## x5 0.01 -0.02 -0.02 -0.02 1.00 -0.01
## y 0.74 0.60 0.07 -0.66 -0.01 1.00
# Plot of y vs x1/x2
rfRnd %>%
group_by(x1, x2) %>%
summarize(n=n(), y=mean(y)) %>%
ggplot(aes(x=x1, y=x2)) +
geom_tile(aes(fill=y)) +
geom_text(aes(label=paste0("n=", n))) +
scale_fill_continuous(low="white", high="blue") +
labs(title="Randomly generated data where y=x1*x2 + x3")
The simulated data appears to be as expected. Train and test datasets are created using a 70/30 split:
trnIdx <- sample(1:nrow(rfRnd), replace=FALSE, size=round(0.7*nrow(rfRnd)))
rfRndTrain <- rfRnd[trnIdx, ]
rfRndTest <- rfRnd[-trnIdx, ]
First, a linear model is run using the ‘known’ formula:
lmRnd <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x1:x2, data=rfRndTrain)
summary(lmRnd)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x1:x2, data = rfRndTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.568e-12 -8.600e-15 -2.300e-15 1.800e-15 1.193e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.091e-13 2.306e-14 4.729e+00 2.34e-06 ***
## x1 -1.095e-14 6.272e-15 -1.746e+00 0.081 .
## x2 3.504e-14 3.474e-15 1.009e+01 < 2e-16 ***
## x3 1.000e+00 3.740e-15 2.674e+14 < 2e-16 ***
## x4 6.315e-16 3.756e-15 1.680e-01 0.866
## x5 -9.869e-16 1.236e-15 -7.980e-01 0.425
## x6b 9.086e-16 8.958e-15 1.010e-01 0.919
## x6c 1.148e-14 9.190e-15 1.250e+00 0.212
## x1:x2 1.000e+00 7.841e-16 1.275e+15 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.2e-13 on 3491 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.364e+30 on 8 and 3491 DF, p-value: < 2.2e-16
rfRndOut <- rfRndTest %>%
mutate(predLM=predict(lmRnd, newdata=.),
errLM=predLM-y
)
# Show RMSE for rfRndOut
rfRndOut %>%
group_by(x6) %>%
summarize(baseVar=var(y), lmVar=mean(errLM**2)) %>%
mutate(lmR2=1-lmVar/baseVar)
## # A tibble: 3 x 4
## x6 baseVar lmVar lmR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 4.70e-26 1
## 2 b 216. 4.71e-26 1
## 3 c 246. 5.33e-26 1
As expected, the linear model using the known formula has 100% explanatory power on both the test and the training datasets.
Next, a random forest regression is run using all the default settings:
# Run the random forest model
rfRndModel <- randomForest::randomForest(y=rfRndTrain$y,
x=rfRndTrain[, c("x1", "x2", "x3", "x4", "x5", "x6")]
)
# Assess accuracy on training dataset
rfRndModel
##
## Call:
## randomForest(x = rfRndTrain[, c("x1", "x2", "x3", "x4", "x5", "x6")], y = rfRndTrain$y)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 5.701355
## % Var explained: 97.82
# Plot evolution of R-squared
tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
ggplot(aes(x=ntree, y=rsq)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# trees", y="R-squared", title="Default parameters, all variables included")
# Plot variable importance
rfRndModel$importance %>%
as.data.frame() %>%
rownames_to_column("variable") %>%
ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) +
geom_col(fill="lightblue") +
coord_flip() +
labs(x="", title="Default parameters, all variables included")
# Assess performance on training data
rfRndOut <- rfRndOut %>%
mutate(predRF=predict(rfRndModel, newdata=.),
errRF=predRF-y
)
# Show RMSE for rfRndOut
rfRndOut %>%
group_by(x6) %>%
summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
mutate(rfR2=1-rfVar/baseVar)
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 7.44 0.977
## 2 b 216. 3.20 0.985
## 3 c 246. 6.10 0.975
The random forest performs well, though a few limitations stand out:
Part of the issue is that there are 3 unnecessary variables while mtry is set to 2. So, there are meaningful splits where no relevant variable is included, and meaningful splits where the negative correlation random variable is used in place of the positive correlation variable due to availability.
Suppose mtry is increased to 4:
# Run the random forest model
rfRndModel <- randomForest::randomForest(y=rfRndTrain$y,
x=rfRndTrain[, c("x1", "x2", "x3", "x4", "x5", "x6")],
mtry=4
)
# Assess accuracy on training dataset
rfRndModel
##
## Call:
## randomForest(x = rfRndTrain[, c("x1", "x2", "x3", "x4", "x5", "x6")], y = rfRndTrain$y, mtry = 4)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 1.294699
## % Var explained: 99.51
# Plot evolution of R-squared
tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
ggplot(aes(x=ntree, y=rsq)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# trees", y="R-squared", title="mtry=4, all variables included")
# Plot variable importance
rfRndModel$importance %>%
as.data.frame() %>%
rownames_to_column("variable") %>%
ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) +
geom_col(fill="lightblue") +
coord_flip() +
labs(x="", title="mtry=4, all variables included")
# Assess performance on training data
rfRndOut <- rfRndOut %>%
mutate(predRF=predict(rfRndModel, newdata=.),
errRF=predRF-y
)
# Show RMSE for rfRndOut
rfRndOut %>%
group_by(x6) %>%
summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
mutate(rfR2=1-rfVar/baseVar)
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 2.21 0.993
## 2 b 216. 0.450 0.998
## 3 c 246. 1.55 0.994
R-squared improves significantly yo ~99.5%. The model correctly puts no importance on x5 and x6 but still puts a high importance on x4 which is known to be unnecessary.
Suppose that the random forest trains only on the three variables known to be important, and is allowed to use all features at every split while drilling down to a nodesize of 1:
# Run the random forest model
rfRndModel <- randomForest::randomForest(y=rfRndTrain$y,
x=rfRndTrain[, c("x1", "x2", "x3")],
mtry=3,
nodesize=1
)
# Assess accuracy on training dataset
rfRndModel
##
## Call:
## randomForest(x = rfRndTrain[, c("x1", "x2", "x3")], y = rfRndTrain$y, mtry = 3, nodesize = 1)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.5997398
## % Var explained: 99.77
# Plot evolution of R-squared
tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
ggplot(aes(x=ntree, y=rsq)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# trees", y="R-squared", title="mtry=3, nodesize=1, only includes x1, x2, x3")
# Plot variable importance
rfRndModel$importance %>%
as.data.frame() %>%
rownames_to_column("variable") %>%
ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) +
geom_col(fill="lightblue") +
coord_flip() +
labs(x="", title="mtry=3, nodesize=1, only includes x1, x2, x3")
# Assess performance on training data
rfRndOut <- rfRndOut %>%
mutate(predRF=predict(rfRndModel, newdata=.),
errRF=predRF-y
)
# Show RMSE for rfRndOut
rfRndOut %>%
group_by(x6) %>%
summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
mutate(rfR2=1-rfVar/baseVar)
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 0.634 0.998
## 2 b 216. 0.234 0.999
## 3 c 246. 0.553 0.998
Performance remains roughly the same, with R-squared of ~99.5% and the overwhelming majority of the importance placed on x1 and x2.
The process is converted to functional form so that it is easier to run:
# Helper function for running and reporting on random forest models
helperRFRandom <- function(dfData=rfRndTrain,
yVar="y",
xVar=c("x1", "x2", "x3", "x4", "x5", "x6"),
ntree=500,
mtry=max(1, floor(length(xVar)/3)),
nodesize=5,
dfErr=rfRndOut,
title=NULL
) {
# FUNCTION ARGUMENTS
# dfData: tibble or data frame containing inpuut data for model
# yVar: y variable for modeling
# xVar: x variable(s) for modeling
# ntree: number of trees to run
# mtry: Number of variables to consider at each split
# nodesize: size of terminal nodes
# dfErr: tibble or data frame containing test data and error estimated
# title: title for the plots
# Run the random forest model
rfRndModel <- randomForest::randomForest(y=dfData[, yVar, drop=TRUE],
x=dfData[, xVar],
ntree=ntree,
mtry=mtry,
nodesize=nodesize
)
# Assess accuracy on training dataset
print(rfRndModel)
# Plot evolution of R-squared
p1 <- tibble::tibble(rsq=rfRndModel$rsq, ntree=1:length(rsq)) %>%
ggplot(aes(x=ntree, y=rsq)) +
geom_point() +
ylim(c(NA, 1)) +
labs(x="# trees", y="R-squared", title=title)
print(p1)
# Plot variable importance
p2 <- rfRndModel$importance %>%
as.data.frame() %>%
rownames_to_column("variable") %>%
ggplot(aes(x=fct_reorder(variable, IncNodePurity), y=IncNodePurity)) +
geom_col(fill="lightblue") +
coord_flip() +
labs(x="", title=title)
print(p2)
# Assess performance on training data
dfErr <- dfErr %>%
mutate(predRF=predict(rfRndModel, newdata=.),
errRF=predRF-y
)
# Show RMSE for rfRndOut
dfErr %>%
group_by(x6) %>%
summarize(baseVar=var(y), rfVar=mean(errRF**2)) %>%
mutate(rfR2=1-rfVar/baseVar) %>%
print()
}
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3"),
mtry=3, nodesize=1, ntree=100, title="y ~ f(x1, x2, x3); mtry=3, nodesize=1; ntree=100"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.6119342
## % Var explained: 99.77
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 0.598 0.998
## 2 b 216. 0.232 0.999
## 3 c 246. 0.507 0.998
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x4", "x5", "x6"),
mtry=6, nodesize=1, ntree=100,
title="y ~ f(x1, x2, x3, x4, x5, x6); mtry=6, nodesize=1; ntree=100"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 0.8040267
## % Var explained: 99.69
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 1.11 0.997
## 2 b 216. 0.315 0.999
## 3 c 246. 0.653 0.997
Allowing for all variables to be used at every split reduces x4 to near-zero explanatory power, as expected given that x1 should always be a better choice when both are available at a given split.
Suppose that the model is run once excluding each of the variables:
# Exclude x1
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x2", "x3", "x4", "x5", "x6"),
mtry=5, nodesize=1, ntree=250,
title="y ~ f(exclude x1); mtry=5, nodesize=1; ntree=250"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 40.15434
## % Var explained: 84.65
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 43.6 0.864
## 2 b 216. 34.8 0.839
## 3 c 246. 37.5 0.848
# Exclude x2
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x3", "x4", "x5", "x6"),
mtry=5, nodesize=1, ntree=250,
title="y ~ f(exclude x2); mtry=5, nodesize=1; ntree=250"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 134.2149
## % Var explained: 48.7
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 144. 0.551
## 2 b 216. 116. 0.462
## 3 c 246. 124. 0.496
# Exclude x3
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x4", "x5", "x6"),
mtry=5, nodesize=1, ntree=250,
title="y ~ f(exclude x3); mtry=5, nodesize=1; ntree=250"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 1.933025
## % Var explained: 99.26
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 2.32 0.993
## 2 b 216. 1.19 0.994
## 3 c 246. 1.76 0.993
# Exclude x4
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x5", "x6"),
mtry=5, nodesize=1, ntree=250,
title="y ~ f(exclude x4); mtry=5, nodesize=1; ntree=250"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.7902814
## % Var explained: 99.7
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 0.878 0.997
## 2 b 216. 0.320 0.999
## 3 c 246. 0.557 0.998
# Exclude x5
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x4", "x6"),
mtry=5, nodesize=1, ntree=250,
title="y ~ f(exclude x5); mtry=5, nodesize=1; ntree=250"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.8016457
## % Var explained: 99.69
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 1.06 0.997
## 2 b 216. 0.285 0.999
## 3 c 246. 0.701 0.997
# Exclude x6
helperRFRandom(rfRndTrain, yVar="y", xVar=c("x1", "x2", "x3", "x4", "x5"),
mtry=5, nodesize=1, ntree=250,
title="y ~ f(exclude x6); mtry=5, nodesize=1; ntree=250"
)
##
## Call:
## randomForest(x = dfData[, xVar], y = dfData[, yVar, drop = TRUE], ntree = ntree, mtry = mtry, nodesize = nodesize)
## Type of random forest: regression
## Number of trees: 250
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.8493107
## % Var explained: 99.68
## # A tibble: 3 x 4
## x6 baseVar rfVar rfR2
## <fct> <dbl> <dbl> <dbl>
## 1 a 321. 1.05 0.997
## 2 b 216. 0.256 0.999
## 3 c 246. 0.675 0.997
Broadly, results are as expected:
A look at extrapolation can also be informative. Suppose the formulae remain the same but there has been a 40% expansion of all data in rfTest:
rfRndExtrap <- rfRndTest %>%
mutate(x1=round(1.4*x1), x2=round(1.4*x2), x3=1.4*x3, x4=1.4*x4, x5=1.4*x5, y=x1*x2+x3)
rfRndExtrap %>%
mutate(pred=predict(rfRndModel, newdata=.)) %>%
ggplot(aes(x=y, y=pred)) +
geom_point(alpha=0.2) +
geom_smooth() +
geom_abline(color="red", lty=2) +
labs(x="Actual y", y="Predicted y", title="Random forest extrapolation when all variables scaled by 40%")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
As the random forest model starts encountering x1 and x2 that are “out of bounds” from the training set, the predictions become significantly worse. The random forest model has not learned anything mathematical that can be extrapolated; instead, it has learned the relationships of the features specific to the training data bounds that were provided.
Contrast these findings with an extrapolated linear model:
rfRndExtrap %>%
mutate(pred=predict(lmRnd, newdata=.)) %>%
ggplot(aes(x=y, y=pred)) +
geom_point(alpha=0.2) +
geom_smooth() +
geom_abline(color="red", lty=2) +
labs(x="Actual y", y="Predicted y", title="Linear model extrapolation when all variables scaled by 40%")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The linear model has learned the mathematical relationship and can extrapolate outside the bounds of the training data. There is always risk to extrapolation, but the linear model is not subject to an artifical hard ceiling and floor as is the case for the random forest model.
Further, suppose that x3 were to become a more important variable, with sd=20 rather than the baseline sd=1:
rfRndExtrap <- rfRndTest %>%
mutate(x3=20*x3, y=x1*x2+x3)
rfRndExtrap %>%
mutate(pred=predict(rfRndModel, newdata=.)) %>%
ggplot(aes(x=y, y=pred)) +
geom_point(alpha=0.2) +
geom_smooth() +
geom_abline(color="red", lty=2) +
labs(x="Actual y", y="Predicted y", title="Random forest model extrapolation when x3 scaled by 20")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
rfRndExtrap %>%
mutate(pred=predict(lmRnd, newdata=.)) %>%
ggplot(aes(x=y, y=pred)) +
geom_point(alpha=0.2) +
geom_smooth() +
geom_abline(color="red", lty=2) +
labs(x="Actual y", y="Predicted y", title="Linear model extrapolation when x3 scaled by 20")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The random forest becomes quite inaccurate since it never needed to learn much about x3 to make good predictions. The linear model having learned that the coefficient for x3 is 1.00 handles a significant change in x3 while still making accurate predictions.
While the random forest model performs outstanding on the bias/variance tradeoff for a given dataset, it can become highly biased when extrapolated for input values that are not congruent with the training data bounds. This is driven in large part by the forced ceiling and floor; a random forest will never predict a category that it has not seen, nor can it predict a value above/below the range of values it has seen.